Arabic Text Extraction using PDFTextStripper working partially
--------------------------------------------------------------
Key: PDFBOX-1039
URL: https://issues.apache.org/jira/browse/PDFBOX-1039
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.5.0
Environment: Windows XP, Java 1.6
Reporter: Franklin
Attachments: TestPDFCreator.pdf, TestWord.pdf
I have been trying to extract the contents of PDF file (so as to index it with
lucene). The PDF file contains arabic.
Both PDF files contain the exact same information. The strange thing is
PDFTextStripper extract data from one file correctly(gives proper arabic) but
not from the other(gives complete question marks ???? or [][][][][] )
Below is the code being used
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class TesExtraction {
// Extract text from PDF Document
static String pdftoText(String fileName) {
PDFParser parser;
String parsedText = null;;
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.err.println("File " + fileName + " does not
exist.");
return null;
}
try
{
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " +
e.getMessage());
return null;
}
try
{
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper("CP-1252");
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
System.err
.println("An exception occured in
parsing the PDF Document."
+ e.getMessage());
} finally {
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return parsedText;
}
public static void main(String args[])
{
System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestWord.pdf"));
System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestPDFCreator.pdf"));
}
}
NOTE: Where can I upload the pdf files ?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira