Thanks for reaching out Aditya and for using Tika Python. This issue is best solved upstream in [email protected] so I am copying that list and making it the reply to.
The issue likely lies in the PDFBox algorithm. There are PDFBox folks on this list. They can help you. Hopefully there is a simple config setting to help out. Cheers, Chris From: Aditya Sardesai <[email protected]> Date: Thursday, August 27, 2020 at 11:44 PM To: "Mattmann, Chris A (US 1740)" <[email protected]> Subject: [EXTERNAL] I have some questions about tika-python Greetings Chris, We had a requirement for our project which required parsing PDF files and extracting the text for some verification. I tried a number of other python packages but they all had issues recognizing text consistently across the file. The most common issue which we faced was text not dumped the correct sequence. This was until we found Tika. We are very impressed by the recognition of text sequencing. It is exactly how we want. However, we're facing an issue with vertically aligned text. There are two examples of vertically aligned text which I can show. In one instance the text is parsed correctly but not in the other. Ex1. In this the word values is read as, V al ue s Ex2. In this, the date is parsed correctly as, 2020-07-16 00:30 Can you please help us understand if there are some specifics about the tika algorithm, we should be aware of? Any suggestions on how we can better use the tool? Please let me know if I need to connect with any other contributor for this. Looking forward to your valuable comments. Regards, __ Aditya Sardesai Lead Quality Engineer [email protected] Connect with me on: LinkedIn See Beyond, Rise Above
