Hi Chris - thank you for forwarding the request! Once the team has reviewed I'll give it another try.
Thank you, Hannah On Wed, May 26, 2021 at 5:24 PM Chris Mattmann <mattm...@apache.org> wrote: > Hannah, I am pushing your question upstream to the dev@tika list. I think > what you need is for them to look > at your config file which I’ve reattached below pasted, and then see if it > looks ok. Then in Tika Python you need > to give it this config file before your server starts up or outside of > Python just start your server with this config > file working, then Tika Python will pick it up: > > > > <?xml version="1.0" encoding="UTF-8"?> > > <properties> > > <parsers> > > <!-- Exclude default values --> > > <parser class="org.apache.tika.parser.DefaultParser"> > > <!-- <property-exclude name = "sortByPosition"/>--> > > <mime-exclude>application/pdf</mime-exclude> > > </parser> > > <!-- Ensure that sorts by position --> > > <parser class="org.apache.tika.parser.EmptyParser"> > > <mime>application/pdf</mime> > > <property name="sortByPosition" value="true"/> > > </parser> > > </parsers> > > </properties> > > > > > > Cheers, > > Chris > > > > > > *From: *Hannah Eli <elihann...@gmail.com> > *Date: *Wednesday, May 26, 2021 at 1:47 PM > *To: *"Mattmann, Chris A (US 1740)" <chris.a.mattm...@jpl.nasa.gov> > *Subject: *[EXTERNAL] Question on custom tika-python configs for OMB PDF > > > > Hi Chris, > > > > Hope you're well. I'm trying to use tika to parse the table of contents > for the Office of Management and Budget's A-11 Circular PDF > <https://urldefense.us/v3/__https:/www.whitehouse.gov/wp-content/uploads/2018/06/a11_web_toc.pdf__;!!PvBDto6Hs4WbVuu7!aHaS3pr3WwzObTFHgaGkqMCJppTbQKWTCHqYM3RU4jHtF7_QT2I398YFRJBbMCfLWTVf_0yR9A$> > (I > know it's short enough to parse manually, but we're building a repeatable > extract). When I do so, the text is parsed out of order. I was trying to > fix this by creating a custom config file with the sortbyPosition property > (see attached), but I'm not an XML guru and don't believe it's working > properly. I've also tried changing the Windows environment variables to > point to this file. > > > > Any guidance would be much appreciated. > > > > Thank you! > > Hannah > > > > -- > > *Hannah Eli* > -- *Hannah Eli* (317) 656-1366 | elihann...@gmail.com