[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763903#comment-17763903
 ] 

Tim Allison commented on PDFBOX-5682:
-------------------------------------

Both files spend quite a bit of time in "parseObjectDynamically" when I call 
this:

        PDDocument document = Loader.loadPDF(path.toFile());
        List<COSObject> objs = 
document.getDocument().getObjectsByType(COSName.FILESPEC);


> Long/permanent hang in PDFBox 3.x
> ---------------------------------
>
>                 Key: PDFBOX-5682
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5682
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to