Ben Ranker <bran...@...> writes: > One of our developers has dealt with the problem of stripping � from PDF > output before. Unfortunately she’s on vacation at the moment, so I can’t > just bop over to her cube and ask her. I *think* she wrote a special > dissemination for the purpose of getting the PDF text and stripping out > unwanted characters, but I’m not certain of that. I’ll send her an email, in > any case, and see if she can offer some advice when she gets back. >
The latest version of the Muradora source has a method stripNonValidXMLCharacters() in org.apache.solr.handler.TransformerToText that should do the trick. We had to modify the getTextFromPDF method to use that method as well in that same class to prevent issues when indexing. It had been previously only called by the code that extracted text from Word documents. ------------------------------------------------------------------------------ _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
