Quoth Serhiy Polyakov on Wed, Apr 14, 2010 at 03:00:46PM -0400:
> I use FedoraGSearch to index and have the same problem for about 20%
> of PDFs in my collection. Log has the following:
> 
> org.xml.sax.SAXParseException: Character reference "&#24" in an
> invalid XML character
> 
> The same for "&#0"

XML 1.0 does not allow  — or in fact most characters from � to 
with a few exceptions — to appear within a document. XML 1.1 allows  and
up but still does not allow �. I’m afraid I don’t have a practical
solution handy, but perhaps a suggestion might point you in the right
direction: If you’re able to convince gsearch and solr to communicate using
XML 1.1, it should solve your  problem.

It won’t solve the � problem, though, since that’s still disallowed in
XML 1.1. I recommend stripping those out. In fairness, there’s probably not
a good reason for PDF text to contain any of these control characters in the
first place, so your efforts may be better focused on stripping out the
control characters than worrying about XML 1.1 at all.

One of our developers has dealt with the problem of stripping � from PDF
output before. Unfortunately she’s on vacation at the moment, so I can’t
just bop over to her cube and ask her. I *think* she wrote a special
dissemination for the purpose of getting the PDF text and stripping out
unwanted characters, but I’m not certain of that. I’ll send her an email, in
any case, and see if she can offer some advice when she gets back.

-- 
Ben Ranker <[email protected]>
Software Engineer, Sr.
Emory University Libraries

Attachment: signature.asc
Description: Digital signature

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to