GSearch currently (version 2.2) includes PDFBox-0.7.2.jar. The  
problems with these characters presumably started with new pdf tools  
that include these characters, that is why only a smaller part of your  
pdf documents have them. The newest PDFBox is 1.1.0 from 
http://pdfbox.apache.org/download.html 
, it will be included in GSearch soon, but you may try it yourself, I  
would be interested to know, whether a simple substitution to that new  
jar will solve the problem.

Best
Gert


On 14/04/2010, at 23.23, Ben Ranker wrote:

> Quoth Serhiy Polyakov on Wed, Apr 14, 2010 at 03:00:46PM -0400:
>> I use FedoraGSearch to index and have the same problem for about 20%
>> of PDFs in my collection. Log has the following:
>>
>> org.xml.sax.SAXParseException: Character reference "&#24" in an
>> invalid XML character
>>
>> The same for "&#0"
>
> XML 1.0 does not allow  — or in fact most characters from �  
> to 
> with a few exceptions — to appear within a document. XML 1.1 allows  
>  and
> up but still does not allow �. I’m afraid I don’t have a practical
> solution handy, but perhaps a suggestion might point you in the right
> direction: If you’re able to convince gsearch and solr to  
> communicate using
> XML 1.1, it should solve your  problem.
>
> It won’t solve the � problem, though, since that’s still  
> disallowed in
> XML 1.1. I recommend stripping those out. In fairness, there’s  
> probably not
> a good reason for PDF text to contain any of these control  
> characters in the
> first place, so your efforts may be better focused on stripping out  
> the
> control characters than worrying about XML 1.1 at all.
>
> One of our developers has dealt with the problem of stripping �  
> from PDF
> output before. Unfortunately she’s on vacation at the moment, so I  
> can’t
> just bop over to her cube and ask her. I *think* she wrote a special
> dissemination for the purpose of getting the PDF text and stripping  
> out
> unwanted characters, but I’m not certain of that. I’ll send her an  
> email, in
> any case, and see if she can offer some advice when she gets back.
>
> -- 
> Ben Ranker <[email protected]>
> Software Engineer, Sr.
> Emory University Libraries
> <signature.asc><ATT00001..c><ATT00002..c>


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to