Copying the Tika dev list where I think you will find the help you are looking for 😊
From: Mariusz G <[email protected]> Date: Wednesday, December 16, 2020 at 7:04 AM To: "Mattmann, Chris A (US 1740)" <[email protected]> Subject: [EXTERNAL] Tika - problem with Polish encoding Hello Sir, I'm writing to you because I tried everything but unsuccessful. When I use Tika with Polish PDF documents, Polish language is not encoded properly. This is my code: from tika import parser raw = parser.from_file("/Users/mgrub/Downloads/NLP/PCC_Rokita_2019.pdf") raw = str(raw) safe_text = raw.encode('UTF-8', errors='ignore') safe_text = str(safe_text).replace("\n", "").replace("\\", "") print('--- safe text ---' ) print( safe_text ) I've tried several different encoding standards (ISO-8859, ISO-8859-2, Windows-1250, CP852) but with no success. If you can help me I will be grateful, because I don't know who can help better than you. Regards, Mariusz Grubba
