Re: Re: Extracting Arabic text from a PDF

Andreas Lehmkühler Sun, 07 Jun 2009 23:53:21 -0700

Hi,
it's correct that there are some few issues extracting text with non-latin 
charactersets, but there are a lot improvements in the current version and I 
guess your implementation isn't the best one.
You don't have to convert the extracted text from one encoding to the other on 
your own it is all done by pdfbox. Just have a look at the commandline tool 
org.apache.pdfbox.ExtractText as an example implementation.


Andreas Lehmkühler

----- original Nachricht --------

Betreff: Re: Extracting Arabic text from a PDF
Gesendet: Sa, 06. Jun 2009
Von: Dmitry Goldenberg<dgoldenb...@attivio.com>

> It won't work because PDFBox only supports a few encodings which do not
> include Arabic. We had to switch to JPod because of this. It supports more
> encodings and is twice faster.
> 
> ----- Original Message -----
> From: Hesham G. <heshamgne...@gmail.com>
> To: pdfbox-users@incubator.apache.org <pdfbox-users@incubator.apache.org>
> Sent: Thu Jun 04 09:24:07 2009
> Subject: Extracting Arabic text from a PDF
> 
> Hello,
>  
> I'm trying to find a way to read some data from a PDF file using PDFBox and
> write it to a text file.
> For English PDFs the code below works perfect .... But what if the PDF
> contains other languages like Arabic - Chinese - ...
> I'm trying to figure out how to specify the Encoding.
> Can you please tell me what's wrong with my code ? 
> 
> My Code :
> 
>     /**
> 
>     * This method should read an Arabic PDF & write its contents in a text
> file.
> 
>     */
> 
>     private void copyPDFText() {
> 
>     PDDocument pdfFile;
> 
>     try {
> 
>     String file = "C:\\index.pdf"; 
> 
> 
>     pdfFile = PDDocument.load( file ); // Open this pdf to edit. 
> 
> 
>     // Specify the page to read :
> 
>     PDFTextStripper stripper = new PDFTextStripper();
> 
>     stripper.setStartPage( 3 );
> 
>     stripper.setEndPage( 3 );
> 
> 
>     // Read page data (Here's the problem - It reads it as String & not
> bytes as in text files):
> 
>     String pageData = stripper.getText( pdfFile );
> 
>     byte[] pageDataInBytes = pageData.getBytes();
> 
> 
>     String decodedPageData = new String( pageDataInBytes, "Cp1256" ); 
> 
>     byte[] output = decodedPageData.getBytes( "UTF-8" );
> 
> 
>     // Define the text file to write the data to & write the encoded output
> to it : 
> 
>     File outfile = new File( "C:\\index.txt" );
> 
>     FileOutputStream fout = new FileOutputStream( outfile);
> 
> 
>     fout.write(output);
> 
>     fout.close();
> 
> 
>     System.out.println( "Read/Write complete" );
> 
> 
>     pdfFile.save( file );
> 
>     pdfFile.close();
> 
> 
>     } catch (IOException e) {
> 
>     // TODO Auto-generated catch block
> 
>     e.printStackTrace();
> 
>     } catch (COSVisitorException e) {
> 
>     // TODO Auto-generated catch block
> 
>     e.printStackTrace();
> 
>     }
> 
>     }
> 
> 
> 
> Your help is really appreciated ,
> Hesham

--- original Nachricht Ende ----

Re: Re: Extracting Arabic text from a PDF

Reply via email to