GSearch currently (version 2.2) includes PDFBox-0.7.2.jar. The problems with these characters presumably started with new pdf tools that include these characters, that is why only a smaller part of your pdf documents have them. The newest PDFBox is 1.1.0 from http://pdfbox.apache.org/download.html , it will be included in GSearch soon, but you may try it yourself, I would be interested to know, whether a simple substitution to that new jar will solve the problem.
Best Gert On 14/04/2010, at 23.23, Ben Ranker wrote: > Quoth Serhiy Polyakov on Wed, Apr 14, 2010 at 03:00:46PM -0400: >> I use FedoraGSearch to index and have the same problem for about 20% >> of PDFs in my collection. Log has the following: >> >> org.xml.sax.SAXParseException: Character reference "" in an >> invalid XML character >> >> The same for "�" > > XML 1.0 does not allow  — or in fact most characters from � > to  > with a few exceptions — to appear within a document. XML 1.1 allows >  and > up but still does not allow �. I’m afraid I don’t have a practical > solution handy, but perhaps a suggestion might point you in the right > direction: If you’re able to convince gsearch and solr to > communicate using > XML 1.1, it should solve your  problem. > > It won’t solve the � problem, though, since that’s still > disallowed in > XML 1.1. I recommend stripping those out. In fairness, there’s > probably not > a good reason for PDF text to contain any of these control > characters in the > first place, so your efforts may be better focused on stripping out > the > control characters than worrying about XML 1.1 at all. > > One of our developers has dealt with the problem of stripping � > from PDF > output before. Unfortunately she’s on vacation at the moment, so I > can’t > just bop over to her cube and ask her. I *think* she wrote a special > dissemination for the purpose of getting the PDF text and stripping > out > unwanted characters, but I’m not certain of that. I’ll send her an > email, in > any case, and see if she can offer some advice when she gets back. > > -- > Ben Ranker <[email protected]> > Software Engineer, Sr. > Emory University Libraries > <signature.asc><ATT00001..c><ATT00002..c> ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
