Hannah, I am pushing your question upstream to the dev@tika list. I think what
you need is for them to look
at your config file which I’ve reattached below pasted, and then see if it
looks ok. Then in Tika Python you need
to give it this config file before your server starts up or outside of Python
just start your server with this config
file working, then Tika Python will pick it up:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<!-- Exclude default values -->
<parser class="org.apache.tika.parser.DefaultParser">
<!-- <property-exclude name = "sortByPosition"/>-->
<mime-exclude>application/pdf</mime-exclude>
</parser>
<!-- Ensure that sorts by position -->
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/pdf</mime>
<property name="sortByPosition" value="true"/>
</parser>
</parsers>
</properties>
Cheers,
Chris
From: Hannah Eli <[email protected]>
Date: Wednesday, May 26, 2021 at 1:47 PM
To: "Mattmann, Chris A (US 1740)" <[email protected]>
Subject: [EXTERNAL] Question on custom tika-python configs for OMB PDF
Hi Chris,
Hope you're well. I'm trying to use tika to parse the table of contents for the
Office of Management and Budget's A-11 Circular PDF (I know it's short enough
to parse manually, but we're building a repeatable extract). When I do so, the
text is parsed out of order. I was trying to fix this by creating a custom
config file with the sortbyPosition property (see attached), but I'm not an XML
guru and don't believe it's working properly. I've also tried changing the
Windows environment variables to point to this file.
Any guidance would be much appreciated.
Thank you!
Hannah
--
Hannah Eli