[
https://issues.apache.org/jira/browse/PDFBOX-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-3955.
----------------------------------------
Resolution: Fixed
I've fixed the very slow performance. Objects streams were parsed multiple
times when rebuilding the trailer dictionary. But my fix doesn't "heal" the
truncated pdf. It's corrupt and can't be fixed as the root object is missing.
[[email protected]] Thanks for the finding.
> new -- very slow processing on truncated PDF
> --------------------------------------------
>
> Key: PDFBOX-3955
> URL: https://issues.apache.org/jira/browse/PDFBOX-3955
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Reporter: Tim Allison
> Assignee: Andreas Lehmkühler
> Fix For: 2.0.8, 3.0.0
>
>
> In the latest regression run with PDFBox's 2.x branch, we're now getting very
> slow processing on a truncated PDF with PDFBox app's {{ExtractText}}:
> http://162.242.228.174/docs/truncated_pdfs/commoncrawl2_likely_broken/7K/7KK53NK5PVKOUGDSQ4FK6542BNPC4SWB
> Turns out this is not an infinite loop. After 4.5 minutes, {{ExtractText}}
> eventually ended with:
> {noformat}
> Exception in thread "main" java.io.IOException: Missing root object
> specification in trailer.
> at
> org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2508)
> at
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1012)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:950)
> at
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:192)
> at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
> at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {noformat}
> .
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]