Cc’ing PDFBox On Tue, Nov 24, 2020 at 1:18 PM Chris Mattmann <[email protected]> wrote:
> Christian thank you for reaching out. I am copying [email protected] as > I think your question is best directed there since tika python is > downstream > of the processing that happens there. > > > > Best of luck! > > > > Cheers > > Chris > > > > > > From: Christian Faggionato <[email protected]> > Date: Tuesday, November 24, 2020 at 10:10 AM > To: "Mattmann, Chris A (US 1740)" <[email protected]> > Subject: [EXTERNAL] Tika - Issues extracting Arabic script > > > > Dear Chris, > > I am Christian Faggionato, research fellow at the School or Oriental and > African Studies, University of London. At the moment I’m working on > building a corpus of Uyghur texts and some of the content is coming from > pdf files. I wrote a short python script to scrape text from pdf using > tika-python. The script is Arabic, and the output looks good but there is > one major problem: there are many missing spaces between words and I really > do not know how to address this issue. Do you have any suggestions in these > regards? > > I am attaching a pdf file and the script I wrote in case you would like to > check it. Thanks in advance for your help, > > Best > > Christian. > > -- > > Phd, Post-Doctoral Fellow > > Department of Religions and Philosophies > > Room 339 > > SOAS University of London > Thornhaugh Street > > London, WC1H 0XG > > [email protected] > > > >
