[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764356#comment-17764356
 ] 

Andreas Lehmkühler commented on PDFBOX-5682:
--------------------------------------------

[~tallison] Thanks for the explanation. That is suboptimal ... in the end one 
has to dereference all indirect objects to collect all possible occurrences, 
e.g. the first mentioned pdf contains 100k indirect objects and it took some 
time to dereference them all. I'll see if there is any chance to optimize the 
process

> Long/permanent hang in PDFBox 3.x
> ---------------------------------
>
>                 Key: PDFBOX-5682
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5682
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to