It won't work because PDFBox only supports a few encodings which do not include Arabic. We had to switch to JPod because of this. It supports more encodings and is twice faster.
----- Original Message ----- From: Hesham G. <heshamgne...@gmail.com> To: pdfbox-users@incubator.apache.org <pdfbox-users@incubator.apache.org> Sent: Thu Jun 04 09:24:07 2009 Subject: Extracting Arabic text from a PDF Hello, I'm trying to find a way to read some data from a PDF file using PDFBox and write it to a text file. For English PDFs the code below works perfect .... But what if the PDF contains other languages like Arabic - Chinese - ... I'm trying to figure out how to specify the Encoding. Can you please tell me what's wrong with my code ? My Code : /** * This method should read an Arabic PDF & write its contents in a text file. */ private void copyPDFText() { PDDocument pdfFile; try { String file = "C:\\index.pdf"; pdfFile = PDDocument.load( file ); // Open this pdf to edit. // Specify the page to read : PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage( 3 ); stripper.setEndPage( 3 ); // Read page data (Here's the problem - It reads it as String & not bytes as in text files): String pageData = stripper.getText( pdfFile ); byte[] pageDataInBytes = pageData.getBytes(); String decodedPageData = new String( pageDataInBytes, "Cp1256" ); byte[] output = decodedPageData.getBytes( "UTF-8" ); // Define the text file to write the data to & write the encoded output to it : File outfile = new File( "C:\\index.txt" ); FileOutputStream fout = new FileOutputStream( outfile); fout.write(output); fout.close(); System.out.println( "Read/Write complete" ); pdfFile.save( file ); pdfFile.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (COSVisitorException e) { // TODO Auto-generated catch block e.printStackTrace(); } } Your help is really appreciated , Hesham