[
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226666#comment-17226666
]
Andreas Lehmkühler commented on PDFBOX-5009:
--------------------------------------------
[~tilman] Looks good to me, just one small improvement for pdfs consisting of a
lot of pages. To minimize the number of elements within the set, it should be
sufficient to store the page tree nodes:
{code}
if (set.contains(kid))
{
LOG.error("This node has already been visited");
continue;
}
else if (kid.containsKey(COSName.KIDS))
{
set.add(kid);
}
{code}
> Corrupt PDF can lead to a StackOverflow
> ---------------------------------------
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
> Issue Type: Task
> Components: Text extraction
> Affects Versions: 2.0.21
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText
> on the file posted on the Tika issue.
> cc [~dadoonet]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]