FW: [EXTERNAL] Tika - problem with Polish encoding

Chris Mattmann Wed, 16 Dec 2020 09:21:52 -0800

Copying the Tika dev list where I think you will find the help you are looking 
for 😊

From: Mariusz G <[email protected]>
Date: Wednesday, December 16, 2020 at 7:04 AM
To: "Mattmann, Chris A (US 1740)" <[email protected]>
Subject: [EXTERNAL] Tika - problem with Polish encoding

Hello Sir, 

I'm writing to you because I tried everything but unsuccessful.

When I use Tika with Polish PDF documents, Polish language is not encoded 
properly.

This is my code:

from tika import parser
raw = parser.from_file("/Users/mgrub/Downloads/NLP/PCC_Rokita_2019.pdf")
raw = str(raw)
safe_text = raw.encode('UTF-8', errors='ignore')
safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )

I've tried several different encoding standards (ISO-8859, ISO-8859-2, 
Windows-1250, CP852) but with no success.

If you can help me I will be grateful, because I don't know who can help better 
than you.

Regards,

Mariusz Grubba

FW: [EXTERNAL] Tika - problem with Polish encoding

Reply via email to