I’m using Fedora 3.2.1 and GSearch 2.2 to index some PDF documents. The
documents were recently created with a new version of Acrobat Pro.

In my GSearch index, my defaultUpdateIndexDocXslt calls getDatastreamText()
to get the text from my PDF. Whenever GSearch calls that function on my PDF
streams I get a NullPointerException (no backtrace) in my tomcat
catalina.log file, and the resultant solr XML (which is valid somehow)
simply has no output from getDatastreamText(). There’s no interesting
information in my fedoragsearch.log.

I was able to track the problem down to a bug in PDFBox. The bug is still
present in 0.7.4 (the most recently released version AFAICT), but it is
fixed in the tip of PDFBox trunk. I strongly suspect that it’s the following
bug:
https://issues.apache.org/jira/browse/PDFBOX-361

Has anyone else run into this problem? Does anyone have a patch handy? Are
there plans to release a version of GSearch incorporating newer versions of
PDFBox? Are there plans to release a version of GSearch with official
support for Fedora 3.2.1? (GSearch 2.2 advertizes only Fedora 3.1 support.)

Thanks in advance. Unless I hear a solution pretty soon I’m going to start
backporting the referenced PDFBox bugfix to 0.7.2 (the version incorporated
in GSearch 2.2). I’ll post a patch when I have one.

-- 
Ben Ranker <[email protected]>
Emory University Libraries

Attachment: signature.asc
Description: Digital signature

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to