Please upload your file to a sharehoster, and please detail what you
expected and what you got instead, maybe about one specific line that
you think is botched. Compare it with the extraction of Adobe Reader.
Tilman
Am 16.12.2020 um 18:21 schrieb Chris Mattmann:
Copying the Tika dev list where I think you will find the help you are looking
for 😊
From: Mariusz G <[email protected]>
Date: Wednesday, December 16, 2020 at 7:04 AM
To: "Mattmann, Chris A (US 1740)" <[email protected]>
Subject: [EXTERNAL] Tika - problem with Polish encoding
Hello Sir,
I'm writing to you because I tried everything but unsuccessful.
When I use Tika with Polish PDF documents, Polish language is not encoded
properly.
This is my code:
from tika import parser
raw = parser.from_file("/Users/mgrub/Downloads/NLP/PCC_Rokita_2019.pdf")
raw = str(raw)
safe_text = raw.encode('UTF-8', errors='ignore')
safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )
I've tried several different encoding standards (ISO-8859, ISO-8859-2,
Windows-1250, CP852) but with no success.
If you can help me I will be grateful, because I don't know who can help better
than you.
Regards,
Mariusz Grubba