Hannah, I am pushing your question upstream to the dev@tika list. I think what 
you need is for them to look
at your config file which I’ve reattached below pasted, and then see if it 
looks ok. Then in Tika Python you need
to give it this config file before your server starts up or outside of Python 
just start your server with this config
file working, then Tika Python will pick it up:

 

<?xml version="1.0" encoding="UTF-8"?>

<properties>

    <parsers>

        <!-- Exclude default values -->

        <parser class="org.apache.tika.parser.DefaultParser">

<!--            <property-exclude name = "sortByPosition"/>-->

            <mime-exclude>application/pdf</mime-exclude>

        </parser>

        <!-- Ensure that sorts by position -->

        <parser class="org.apache.tika.parser.EmptyParser">

            <mime>application/pdf</mime>

            <property name="sortByPosition" value="true"/>

        </parser>

    </parsers>

</properties>

 

 

Cheers,

Chris

 

 

From: Hannah Eli <elihann...@gmail.com>
Date: Wednesday, May 26, 2021 at 1:47 PM
To: "Mattmann, Chris A (US 1740)" <chris.a.mattm...@jpl.nasa.gov>
Subject: [EXTERNAL] Question on custom tika-python configs for OMB PDF

 

Hi Chris,  

 

Hope you're well. I'm trying to use tika to parse the table of contents for the 
Office of Management and Budget's A-11 Circular PDF (I know it's short enough 
to parse manually, but we're building a repeatable extract). When I do so, the 
text is parsed out of order. I was trying to fix this by creating a custom 
config file with the sortbyPosition property (see attached), but I'm not an XML 
guru and don't believe it's working properly. I've also tried changing the 
Windows environment variables to point to this file. 

 

Any guidance would be much appreciated. 

 

Thank you!

Hannah

 

-- 

Hannah Eli 

Reply via email to