Hi,
I was using following to get the data from pdf:from tika import
parserparser.from_file(file_path, xmlContent=True)The location of the text
segment is changing compared to original pdf page. ( In my case, table at the
beginning of the page is coming at the bottom).
So, I tried using custom config file in following way using the property
"sortByPosition":data = parser.from_file(file_path, xmlContent=True,
config_path=/path/to/'tika_config.xml')<?xml version="1.0"
encoding="UTF-8"?><properties> <parsers> <!-- Default Parser for most
things, except for 2 mime types, and never use the Executable Parser
--> <parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/pdf</mime-exclude> <parser-exclude
class="org.apache.tika.parser.executable.ExecutableParser"/> </parser>
</parsers> <property name="sortByPosition" value="true"/></properties>
But it is not working. Could you please check and let me now what is wrong or
how it should be done?
Thanks & Regards,Gourang Gaurav