Re: Extracting Arabic text from a PDF

Pepijn Schmitz Sat, 13 Jun 2009 04:49:55 -0700

I haven't followed this thread, so I might be way off base, but I
noticed this:


Hesham G. wrote:
>             byte[] arabic_bytes = "بسم الله".getBytes();  // Arabic Text.
That will convert the characters to bytes according to the system
default encoding. Which encoding that is depends on the system you're
running on, but usually it's some kind of 8-bit encoding which doesn't
contain Arabic characters. On a western Windows machine it will probably
be Windows codepage 1252. Any characters which can't be rendered in the
target encoding will be changed to question marks. So the question marks
you're seeing were probably generated here already.

Anyway, this line and the next:
>             String arabic_text = new String( arabic_bytes, "UTF-8" );
...make no sense. You can't encode a String using one encoding (the
system default encoding, whichever that is) and then interpret the bytes
according to a different encoding. You'd end up with gibberish.

Kind regards,
Pepijn Schmitz
>             PDDocument pdfFile = new PDDocument();
>             String filePath = "C:\\new_arabic.pdf";           
> PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new
> File("fonts/times.ttf") );  // Windows Arabic font
>             //font.setEncoding( new WinAnsiEncoding() );
>
>             PDPage page = new PDPage();
>             PDPageContentStream contentStream = new
> PDPageContentStream( pdfFile, page );
>
>             contentStream.beginText();
>             contentStream.setFont( font, 10 );
>             contentStream.moveTextPositionByAmount( 50, 720 );
>             contentStream.drawString( arabic_text );
>             contentStream.moveTextPositionByAmount( 0, -12 );
>             contentStream.endText();
>             contentStream.close();
>
>             pdfFile.addPage( page );
>             pdfFile.save( filePath );
>             pdfFile.close();
>         } catch (IOException e) {
>             e.printStackTrace();
>         } catch (COSVisitorException e) {
>             e.printStackTrace();
>         }
>
> I think the main problem is when writing that String inside the PDF
> which needs an Arabic font(Which I already did) + A suitable Encoding
> to use when writing the text in the PDF, as we do when we write the
> String inside a text file. Am I right ?
> OutputStreamWriter fout = new OutputStreamWriter( new
> FileOutputStream( textFile ), "UTF-8" );
>
> I checked the class WinAnsiEncoding ... I don't understand a lot about
> encoding, but is there a way to define the Bulgarian/Arabic characters
> inside that class using the method addCharacterEncoding(int code,
> COSName name); ?
>
>
> Thanks ,
> Hesham

Re: Extracting Arabic text from a PDF

Reply via email to