Re: Extracting Arabic text from a PDF

Hesham G. Sat, 13 Jun 2009 00:56:35 -0700

Thanks Andreas ...

I first agreed with you that there's no need to redefine the Encoding likethis :

String yourString = new String("pasted text from word","UTF-8");

As a String already contains Unicode, so Bulgarian or Arabic text surelyappears okay inside that String. Am I right ?But anyway I tried what you asked me and it produced question-marks "????????" in the pdf :

                                           try {
                        byte[] arabic_bytes = "بسم الله".getBytes();  // Arabic 
Text.
                        String arabic_text = new String( arabic_bytes, "UTF-8" 
);
                        PDDocument pdfFile = new PDDocument();

String filePath = "C:\\new_arabic.pdf"; PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, newFile("fonts/times.ttf") ); // Windows Arabic font

                        //font.setEncoding( new WinAnsiEncoding() );

                        PDPage page = new PDPage();

PDPageContentStream contentStream = new PDPageContentStream( pdfFile,page );


                        contentStream.beginText();
                        contentStream.setFont( font, 10 );
                        contentStream.moveTextPositionByAmount( 50, 720 );
                        contentStream.drawString( arabic_text );
                        contentStream.moveTextPositionByAmount( 0, -12 );
                        contentStream.endText();
                        contentStream.close();

                        pdfFile.addPage( page );
                        pdfFile.save( filePath );
                        pdfFile.close();
                } catch (IOException e) {
                        e.printStackTrace();
                } catch (COSVisitorException e) {
                        e.printStackTrace();
                }

I think the main problem is when writing that String inside the PDF whichneeds an Arabic font(Which I already did) + A suitable Encoding to use whenwriting the text in the PDF, as we do when we write the String inside a textfile. Am I right ?OutputStreamWriter fout = new OutputStreamWriter( new FileOutputStream(textFile ), "UTF-8" );

I checked the class WinAnsiEncoding ... I don't understand a lot aboutencoding, but is there a way to define the Bulgarian/Arabic charactersinside that class using the method addCharacterEncoding(int code, COSNamename); ?



Thanks ,

Hesham

Re: Extracting Arabic text from a PDF

Reply via email to