Eric R Manzitti created PDFBOX-5290:
---------------------------------------
Summary: ClassCastException during Text Extraction
Key: PDFBOX-5290
URL: https://issues.apache.org/jira/browse/PDFBOX-5290
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.24, 2.0.20
Reporter: Eric R Manzitti
Attachments: newBroke.pdf
I am getting:
java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be
cast to org.apache.pdfbox.cos.COSArray
When executing the following code:
public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
String UTF_8 = "UTF-8";
PDFLibraryProperties pdfLibraryProperties = PDFLibraryProperties.getInstance();
String regex =
pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
byte[] bytesToReturn;
try {
FileInputStream fis = new FileInputStream(new File(fileNamePath));
PDDocument pdfDoc = PDDocument.load(fis);
PDFTextStripper pdfStripper = new PDFTextStripper();
String textFromPDF = pdfStripper.getText(pdfDoc);
pdfDoc.close();
bytesToReturn = textFromPDF.getBytes(UTF_8);
String textStr = new String(bytesToReturn).replaceAll(regex,
PDFLibraryConstants.BLANK_SPACE);
bytesToReturn = textStr.getBytes();
fis.close();
} catch (IOException e) {
pqUtilityLogger.logError(e.getMessage());
throw new PQException("e.getMessage());
}
return bytesToReturn;
}
It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]