Re: [EXTERNAL] Tika - Issues extracting Arabic script

Chris Mattmann Tue, 24 Nov 2020 10:18:14 -0800

Christian thank you for reaching out. I am copying [email protected] as 
I think your question is best directed there since tika python is downstream 
of the processing that happens there.


 

Best of luck!

 

Cheers

Chris

 

 

From: Christian Faggionato <[email protected]>
Date: Tuesday, November 24, 2020 at 10:10 AM
To: "Mattmann, Chris A (US 1740)" <[email protected]>
Subject: [EXTERNAL] Tika - Issues extracting Arabic script

 

Dear Chris, 

I am Christian Faggionato, research fellow at the School or Oriental and 
African Studies, University of London. At the moment I’m working on building a 
corpus of Uyghur texts and some of the content is coming from pdf files. I 
wrote a short python script to scrape text from pdf using tika-python. The 
script is Arabic, and the output looks good but there is one major problem: 
there are many missing spaces between words and I really do not know how to 
address this issue. Do you have any suggestions in these regards? 

I am attaching a pdf file and the script I wrote in case you would like to 
check it. Thanks in advance for your help, 

Best

Christian.

-- 

Phd, Post-Doctoral Fellow

Department of Religions and Philosophies

Room 339

SOAS University of London
Thornhaugh Street

London, WC1H 0XG

[email protected]

Re: [EXTERNAL] Tika - Issues extracting Arabic script

Reply via email to