I’m using Fedora 3.2.1 and GSearch 2.2 to index some PDF documents. The documents were recently created with a new version of Acrobat Pro.
In my GSearch index, my defaultUpdateIndexDocXslt calls getDatastreamText() to get the text from my PDF. Whenever GSearch calls that function on my PDF streams I get a NullPointerException (no backtrace) in my tomcat catalina.log file, and the resultant solr XML (which is valid somehow) simply has no output from getDatastreamText(). There’s no interesting information in my fedoragsearch.log. I was able to track the problem down to a bug in PDFBox. The bug is still present in 0.7.4 (the most recently released version AFAICT), but it is fixed in the tip of PDFBox trunk. I strongly suspect that it’s the following bug: https://issues.apache.org/jira/browse/PDFBOX-361 Has anyone else run into this problem? Does anyone have a patch handy? Are there plans to release a version of GSearch incorporating newer versions of PDFBox? Are there plans to release a version of GSearch with official support for Fedora 3.2.1? (GSearch 2.2 advertizes only Fedora 3.1 support.) Thanks in advance. Unless I hear a solution pretty soon I’m going to start backporting the referenced PDFBox bugfix to 0.7.2 (the version incorporated in GSearch 2.2). I’ll post a patch when I have one. -- Ben Ranker <[email protected]> Emory University Libraries
signature.asc
Description: Digital signature
------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
