[jira] [Created] (PDFBOX-1039) Arabic Text Extraction using PDFTextStripper working partially

Franklin (JIRA) Fri, 17 Jun 2011 01:06:03 -0700

Arabic Text Extraction using PDFTextStripper working partially
--------------------------------------------------------------


                 Key: PDFBOX-1039
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1039
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.5.0
         Environment: Windows XP, Java 1.6
            Reporter: Franklin
         Attachments: TestPDFCreator.pdf, TestWord.pdf

I have been trying to extract the contents of PDF file (so as to index it with 
lucene). The PDF file contains arabic.

Both PDF files contain the exact same information. The strange thing is 
PDFTextStripper extract data from one file correctly(gives proper arabic) but 
not from the other(gives complete question marks ???? or [][][][][]  )

Below is the code being used

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
 
public class TesExtraction {
 
        // Extract text from PDF Document
        static String pdftoText(String fileName) {
                PDFParser parser;
                String parsedText = null;;
                PDFTextStripper pdfStripper = null;
                PDDocument pdDoc = null;
                COSDocument cosDoc = null;
                File file = new File(fileName);
                if (!file.isFile()) {
                        System.err.println("File " + fileName + " does not 
exist.");
                        return null;
                }
                try 
                {
                        parser = new PDFParser(new FileInputStream(file));
                } catch (IOException e) {
                        System.err.println("Unable to open PDF Parser. " + 
e.getMessage());
                        return null;
                }
                try 
                {
                        parser.parse();
                        cosDoc = parser.getDocument();
                        pdfStripper = new PDFTextStripper("CP-1252");
                        pdDoc = new PDDocument(cosDoc);
                        pdfStripper.setStartPage(1);
                        pdfStripper.setEndPage(5);
                        parsedText = pdfStripper.getText(pdDoc);
                } catch (Exception e) {
                        System.err
                                        .println("An exception occured in 
parsing the PDF Document."
                                                        + e.getMessage());
                } finally {
                        try {
                                if (cosDoc != null)
                                        cosDoc.close();
                                if (pdDoc != null)
                                        pdDoc.close();
                        } catch (Exception e) {
                                e.printStackTrace();
                        }
                }
                return parsedText;
        }
        public static void main(String args[])
        {
                
System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestWord.pdf"));
                
System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestPDFCreator.pdf"));
        }
 
}

NOTE: Where can I upload the pdf files ?
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PDFBOX-1039) Arabic Text Extraction using PDFTextStripper working partially

Reply via email to