Re: Extracting Arabic text from a PDF

Andreas Lehmkühler Mon, 08 Jun 2009 23:24:48 -0700

Hi Hesham,

> Hello , Thanks for your reply to my question : 
> http://markmail.org/thread/r2bbln57hjng7bxuSorry for replying
> you through Mail ... I've already sent a reply but it didn't appear in the 
> Thread !! 
Did you get an error message or something like that?
Whatever I'll cc my answer to the list


> What I was trying to say is that I've already seen the ExtractText.java file 
> and I implemented this solution which worked fine. 
> Now I was trying to write the extracted data from the PDF to a PDF file 
> instead of a text file. If the extracted data was in English 
> then it's written okay, but what if the extracted data was in Arabic for 
> example ?The written data in the PDF will appear 
> like this : þÿþÞ jþäþË 
> I've changed the Encoding for the written data in the PDF like this 
> :PDSimpleFont font = PDType1Font.TIMES_ROMAN;
> font.setEncoding( new WinAnsiEncoding() );
> This encoding works fine for German, French, Turkish .... But not Arabic.Any 
> ideas ? Is there a way to use UTF-8 Encoding ? I've > attached my code with 
> this e-mail .... 
> I hope you can help me. Thanks ,Hesham 
The "trick" is to use the correct font. For an arabic text you'll need of 
course a font which supports arabic characters.  Have a look at 
org.apache.pdfbox.TextToPDF as an simple example how to use external fonts for 
drawing.
If you want to use UTF-8-encoding, just do it. Create a string with the wanted 
encoding and add it to the pdf calling the method drawString from 
PDPageContentStream (see org.apache.pdfbox.TextToPDF).

Andreas Lehmkühler

Re: Extracting Arabic text from a PDF

Reply via email to