Hannah, I am pushing your question upstream to the dev@tika list. I think what you need is for them to look at your config file which I’ve reattached below pasted, and then see if it looks ok. Then in Tika Python you need to give it this config file before your server starts up or outside of Python just start your server with this config file working, then Tika Python will pick it up:
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <!-- Exclude default values --> <parser class="org.apache.tika.parser.DefaultParser"> <!-- <property-exclude name = "sortByPosition"/>--> <mime-exclude>application/pdf</mime-exclude> </parser> <!-- Ensure that sorts by position --> <parser class="org.apache.tika.parser.EmptyParser"> <mime>application/pdf</mime> <property name="sortByPosition" value="true"/> </parser> </parsers> </properties> Cheers, Chris From: Hannah Eli <elihann...@gmail.com> Date: Wednesday, May 26, 2021 at 1:47 PM To: "Mattmann, Chris A (US 1740)" <chris.a.mattm...@jpl.nasa.gov> Subject: [EXTERNAL] Question on custom tika-python configs for OMB PDF Hi Chris, Hope you're well. I'm trying to use tika to parse the table of contents for the Office of Management and Budget's A-11 Circular PDF (I know it's short enough to parse manually, but we're building a repeatable extract). When I do so, the text is parsed out of order. I was trying to fix this by creating a custom config file with the sortbyPosition property (see attached), but I'm not an XML guru and don't believe it's working properly. I've also tried changing the Windows environment variables to point to this file. Any guidance would be much appreciated. Thank you! Hannah -- Hannah Eli