Ben Ranker <bran...@...> writes:

> One of our developers has dealt with the problem of stripping &#0; from PDF
> output before. Unfortunately she’s on vacation at the moment, so I can’t
> just bop over to her cube and ask her. I *think* she wrote a special
> dissemination for the purpose of getting the PDF text and stripping out
> unwanted characters, but I’m not certain of that. I’ll send her an email, in
> any case, and see if she can offer some advice when she gets back.
> 

The latest version of the Muradora source has a method 
stripNonValidXMLCharacters() in org.apache.solr.handler.TransformerToText that 
should do the trick. We had to modify the getTextFromPDF method to use that 
method as well in that same class to prevent issues when indexing. It had been 
previously only called by the code that extracted text from Word documents.


------------------------------------------------------------------------------
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to