Please upload your file to a sharehoster, and please detail what you expected and what you got instead, maybe about one specific line that you think is botched. Compare it with the extraction of Adobe Reader.

Tilman

Am 16.12.2020 um 18:21 schrieb Chris Mattmann:
Copying the Tika dev list where I think you will find the help you are looking 
for 😊

From: Mariusz G <[email protected]>
Date: Wednesday, December 16, 2020 at 7:04 AM
To: "Mattmann, Chris A (US 1740)" <[email protected]>
Subject: [EXTERNAL] Tika - problem with Polish encoding

Hello Sir,

I'm writing to you because I tried everything but unsuccessful.

When I use Tika with Polish PDF documents, Polish language is not encoded 
properly.

This is my code:

from tika import parser
raw = parser.from_file("/Users/mgrub/Downloads/NLP/PCC_Rokita_2019.pdf")
raw = str(raw)
safe_text = raw.encode('UTF-8', errors='ignore')
safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )

I've tried several different encoding standards (ISO-8859, ISO-8859-2, 
Windows-1250, CP852) but with no success.

If you can help me I will be grateful, because I don't know who can help better 
than you.

Regards,

Mariusz Grubba



Reply via email to