[ https://issues.apache.org/jira/browse/PDFBOX-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678608#comment-16678608 ]
ASF subversion and git services commented on PDFBOX-4367: --------------------------------------------------------- Commit 1846064 from til...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1846064 ] PDFBOX-4367: run stripper by page as preparation to catch the exception in a later commit; improve usage text > Error expected floating point number actual='18-5' > -------------------------------------------------- > > Key: PDFBOX-4367 > URL: https://issues.apache.org/jira/browse/PDFBOX-4367 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.12 > Environment: Mac OS X Sierra > Reporter: Peter Johnson > Priority: Minor > > Able to repeat with command line. Unfortunately, the only files that repeat > this are from a customer, and contain sensitive information. The file opens > without error in Acrobat Reader and Mac Preview. The desired result is that > any corrupt portions of the PDF are skipped, so that we can use what text is > extractable. > Unfortunately, I still get an error when using the -force option. > We get the following stack trace: > {code:java} > C02V390UHTD6:Downloads pjohnson$ java -jar pdfbox-app-2.0.12.jar ExtractText > 16cccd9af5032a303774f7b87fb95076.pdf > Nov 02, 2018 10:04:54 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray > WARNING: Corrupt object reference at offset 19727 > Exception in thread "main" java.io.IOException: Error expected floating point > number actual='18-5' > at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:78) > at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:110) > at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:947) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:631) > at > org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:174) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237) > at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60) > Caused by: java.lang.NumberFormatException > at java.math.BigDecimal.<init>(BigDecimal.java:494) > at java.math.BigDecimal.<init>(BigDecimal.java:383) > at java.math.BigDecimal.<init>(BigDecimal.java:806) > at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59) > ... 14 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org