Christian thank you for reaching out. I am copying [email protected] as I think your question is best directed there since tika python is downstream of the processing that happens there.
Best of luck! Cheers Chris From: Christian Faggionato <[email protected]> Date: Tuesday, November 24, 2020 at 10:10 AM To: "Mattmann, Chris A (US 1740)" <[email protected]> Subject: [EXTERNAL] Tika - Issues extracting Arabic script Dear Chris, I am Christian Faggionato, research fellow at the School or Oriental and African Studies, University of London. At the moment I’m working on building a corpus of Uyghur texts and some of the content is coming from pdf files. I wrote a short python script to scrape text from pdf using tika-python. The script is Arabic, and the output looks good but there is one major problem: there are many missing spaces between words and I really do not know how to address this issue. Do you have any suggestions in these regards? I am attaching a pdf file and the script I wrote in case you would like to check it. Thanks in advance for your help, Best Christian. -- Phd, Post-Doctoral Fellow Department of Religions and Philosophies Room 339 SOAS University of London Thornhaugh Street London, WC1H 0XG [email protected]
