[
https://issues.apache.org/jira/browse/PDFBOX-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-1039.
----------------------------------------
Resolution: Not A Problem
Assignee: Andreas Lehmkühler
Everything works as expected from the PDFBox point of view.
I'm afraid one can't extract the text from those kind of pdfs. The font uses a
builtin encoding. It just numbers all characters from 0 to 5 without any
mapping to readable characters.
Try to extract the text using the acrobat reader (mark text, copy and paste it)
and you'll get the same result.
> Arabic Text Extraction using PDFTextStripper working partially
> --------------------------------------------------------------
>
> Key: PDFBOX-1039
> URL: https://issues.apache.org/jira/browse/PDFBOX-1039
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.5.0
> Environment: Windows XP, Java 1.6
> Reporter: Franklin
> Assignee: Andreas Lehmkühler
> Labels: arabic, textExtraction
> Attachments: TestPDFCreator.pdf, TestWord.pdf
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> I have been trying to extract the contents of PDF file (so as to index it
> with lucene). The PDF file contains arabic.
> Both PDF files contain the exact same information. The strange thing is
> PDFTextStripper extract data from one file correctly(gives proper arabic) but
> not from the other(gives complete question marks ???? or [][][][][] )
> Below is the code being used
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
>
> public class TesExtraction {
>
> // Extract text from PDF Document
> static String pdftoText(String fileName) {
> PDFParser parser;
> String parsedText = null;;
> PDFTextStripper pdfStripper = null;
> PDDocument pdDoc = null;
> COSDocument cosDoc = null;
> File file = new File(fileName);
> if (!file.isFile()) {
> System.err.println("File " + fileName + " does not
> exist.");
> return null;
> }
> try
> {
> parser = new PDFParser(new FileInputStream(file));
> } catch (IOException e) {
> System.err.println("Unable to open PDF Parser. " +
> e.getMessage());
> return null;
> }
> try
> {
> parser.parse();
> cosDoc = parser.getDocument();
> pdfStripper = new PDFTextStripper("CP-1252");
> pdDoc = new PDDocument(cosDoc);
> pdfStripper.setStartPage(1);
> pdfStripper.setEndPage(5);
> parsedText = pdfStripper.getText(pdDoc);
> } catch (Exception e) {
> System.err
> .println("An exception occured in
> parsing the PDF Document."
> + e.getMessage());
> } finally {
> try {
> if (cosDoc != null)
> cosDoc.close();
> if (pdDoc != null)
> pdDoc.close();
> } catch (Exception e) {
> e.printStackTrace();
> }
> }
> return parsedText;
> }
> public static void main(String args[])
> {
>
> System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestWord.pdf"));
>
> System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestPDFCreator.pdf"));
> }
>
> }
> NOTE: Where can I upload the pdf files ?
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira