Re: [EXTERNAL] Tika - Issues extracting Arabic script

Tim Allison Tue, 24 Nov 2020 11:55:24 -0800

Cc’ing PDFBox

On Tue, Nov 24, 2020 at 1:18 PM Chris Mattmann <[email protected]> wrote:


> Christian thank you for reaching out. I am copying [email protected] as
> I think your question is best directed there since tika python is
> downstream
> of the processing that happens there.
>
>
>
> Best of luck!
>
>
>
> Cheers
>
> Chris
>
>
>
>
>
> From: Christian Faggionato <[email protected]>
> Date: Tuesday, November 24, 2020 at 10:10 AM
> To: "Mattmann, Chris A (US 1740)" <[email protected]>
> Subject: [EXTERNAL] Tika - Issues extracting Arabic script
>
>
>
> Dear Chris,
>
> I am Christian Faggionato, research fellow at the School or Oriental and
> African Studies, University of London. At the moment I’m working on
> building a corpus of Uyghur texts and some of the content is coming from
> pdf files. I wrote a short python script to scrape text from pdf using
> tika-python. The script is Arabic, and the output looks good but there is
> one major problem: there are many missing spaces between words and I really
> do not know how to address this issue. Do you have any suggestions in these
> regards?
>
> I am attaching a pdf file and the script I wrote in case you would like to
> check it. Thanks in advance for your help,
>
> Best
>
> Christian.
>
> --
>
> Phd, Post-Doctoral Fellow
>
> Department of Religions and Philosophies
>
> Room 339
>
> SOAS University of London
> Thornhaugh Street
>
> London, WC1H 0XG
>
> [email protected]
>
>
>
>

Re: [EXTERNAL] Tika - Issues extracting Arabic script

Reply via email to