Thanks for reaching out Aditya and for using Tika Python. This issue is 
best solved upstream in [email protected] so I am copying that list
and making it the reply to.

 

The issue likely lies in the PDFBox algorithm. There are PDFBox folks on
this list. They can help you. Hopefully there is a simple config setting
to help out.

 

Cheers,

Chris

 

 

From: Aditya Sardesai <[email protected]>
Date: Thursday, August 27, 2020 at 11:44 PM
To: "Mattmann, Chris A (US 1740)" <[email protected]>
Subject: [EXTERNAL] I have some questions about tika-python

 

Greetings Chris,

 

We had a requirement for our project which required parsing PDF files and 
extracting the text for some verification. I tried a number of other python 
packages but they all had issues recognizing text consistently across the file.

 

The most common issue which we faced was text not dumped the correct sequence. 
This was until we found Tika. We are very impressed by the recognition of text 
sequencing. It is exactly how we want.

 

However, we're facing an issue with vertically aligned text. There are two 
examples of vertically aligned text which I can show. In one instance the text 
is parsed correctly but not in the other.

 

Ex1.

  

In this the word values is read as,

V

al

 

ue

s

 

Ex2.

In this, the date is parsed correctly as,

2020-07-16 00:30

 

Can you please help us understand if there are some specifics about the tika 
algorithm, we should be aware of? Any suggestions on how we can better use the 
tool?

Please let me know if I need to connect with any other contributor for this.

 

Looking forward to your valuable comments.

 

 

Regards,

__

 

Aditya Sardesai

Lead Quality Engineer



[email protected]

Connect with me on: LinkedIn



See Beyond, Rise Above

 

Reply via email to