[ 
https://issues.apache.org/jira/browse/PDFBOX-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-3955.
----------------------------------------
    Resolution: Fixed

I've fixed the very slow performance. Objects streams were parsed multiple 
times when rebuilding the trailer dictionary. But my fix doesn't "heal" the 
truncated pdf. It's corrupt and can't be fixed as the root object is missing.

[[email protected]] Thanks for the finding.

> new -- very slow processing on truncated PDF
> --------------------------------------------
>
>                 Key: PDFBOX-3955
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3955
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>             Fix For: 2.0.8, 3.0.0
>
>
> In the latest regression run with PDFBox's 2.x branch, we're now getting very 
> slow processing on a truncated PDF with PDFBox app's {{ExtractText}}:
> http://162.242.228.174/docs/truncated_pdfs/commoncrawl2_likely_broken/7K/7KK53NK5PVKOUGDSQ4FK6542BNPC4SWB
> Turns out this is not an infinite loop.  After 4.5 minutes, {{ExtractText}} 
> eventually ended with: 
> {noformat}
> Exception in thread "main" java.io.IOException: Missing root object 
> specification in trailer.
>         at 
> org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2508)
>         at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1012)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:950)
>         at 
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:192)
>         at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {noformat}
> .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to