[
https://issues.apache.org/jira/browse/PDFBOX-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr resolved PDFBOX-3742.
-------------------------------------
Resolution: Fixed
Assignee: Tilman Hausherr
Fix Version/s: 3.0.0
2.0.6
1.8.14
> Unknown dir object c='>' cInt=62 peek='>' peekInt=62
> ----------------------------------------------------
>
> Key: PDFBOX-3742
> URL: https://issues.apache.org/jira/browse/PDFBOX-3742
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.13, 2.0.5
> Environment: Based on Tika Docker image:
> logicalspark/docker-tikaserver
> Reporter: Igor Santos
> Assignee: Tilman Hausherr
> Fix For: 1.8.14, 2.0.6, 3.0.0
>
> Attachments: buggy.pdf, screenshot_002.png
>
>
> This was originally stumbled upon when running a 69-page long PDF through
> Tika. I could isolate the issue to in-between those two pages. Tika ends up
> responding with a faulty XML, as the attached screenshot shows - together
> with a stacktrace on the logs that includes the PDFBox exception, shown below
> as reproduced from the standalone CLI tool.
> I'm using Tika 1.1.4, although I'm not exactly sure what version of PDFBox it
> uses. Here's the base
> [Dockerfile|https://github.com/LogicalSpark/docker-tikaserver/blob/master/Dockerfile].
> {code}
> $ java -jar pdfbox-app-2.0.5.jar ExtractText buggy.pdf
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSans-Bold' for 'Arial-BoldMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSans' for 'ArialMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
> WARNING: Corrupt object reference at offset 150196
> Exception in thread "main" java.io.IOException: Unknown dir object c='>'
> cInt=62 peek='>' peekInt=62 at offset 150196
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:954)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:654)
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:502)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> at
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237)
> at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
> at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}
> Seems related to PDFBOX-1327.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]