Quoth Serhiy Polyakov on Wed, Apr 14, 2010 at 03:00:46PM -0400: > I use FedoraGSearch to index and have the same problem for about 20% > of PDFs in my collection. Log has the following: > > org.xml.sax.SAXParseException: Character reference "" in an > invalid XML character > > The same for "�"
XML 1.0 does not allow  — or in fact most characters from � to  with a few exceptions — to appear within a document. XML 1.1 allows  and up but still does not allow �. I’m afraid I don’t have a practical solution handy, but perhaps a suggestion might point you in the right direction: If you’re able to convince gsearch and solr to communicate using XML 1.1, it should solve your  problem. It won’t solve the � problem, though, since that’s still disallowed in XML 1.1. I recommend stripping those out. In fairness, there’s probably not a good reason for PDF text to contain any of these control characters in the first place, so your efforts may be better focused on stripping out the control characters than worrying about XML 1.1 at all. One of our developers has dealt with the problem of stripping � from PDF output before. Unfortunately she’s on vacation at the moment, so I can’t just bop over to her cube and ask her. I *think* she wrote a special dissemination for the purpose of getting the PDF text and stripping out unwanted characters, but I’m not certain of that. I’ll send her an email, in any case, and see if she can offer some advice when she gets back. -- Ben Ranker <[email protected]> Software Engineer, Sr. Emory University Libraries
signature.asc
Description: Digital signature
------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev
_______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
