Python Tika parsing order issue

Gourang Gaurav Tue, 12 May 2020 09:58:18 -0700
Hi,
I was using following to get the data from pdf:from tika import 
parserparser.from_file(file_path, xmlContent=True)The location of the text 
segment is changing compared to original pdf page. ( In my case, table at the 
beginning of the page is coming at the bottom). 
So, I tried using custom config file in following way using the property 
"sortByPosition":data = parser.from_file(file_path, xmlContent=True,
                                      
config_path=/path/to/'tika_config.xml')<?xml version="1.0" 
encoding="UTF-8"?><properties>  <parsers>    <!-- Default Parser for most 
things, except for 2 mime types, and never         use the Executable Parser 
-->    <parser class="org.apache.tika.parser.DefaultParser">      
<mime-exclude>image/jpeg</mime-exclude>      
<mime-exclude>application/pdf</mime-exclude>      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>    </parser>  
</parsers>  <property name="sortByPosition" value="true"/></properties>
But it is not working. Could you please check and let me now what is wrong or 
how it should be done?
Thanks & Regards,Gourang Gaurav
Python Tika parsing order issue

Reply via email to