[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228004#comment-17228004 ] Tilman Hausherr commented on PDFBOX-5009: - I thought I was getting a stack overflow with PDFDebugger but no, this was probably because of some local changes. Doing PDPage.get() on such files can bring an unchecked exception. Still bad, but not as bad as a stack overflow. So I have documented it. Preventing it seems tricky and would require an API change. It should be done in a separate issue. > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.22, 3.0.0 PDFBox > > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227987#comment-17227987 ] ASF subversion and git services commented on PDFBOX-5009: - Commit 1883201 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1883201 ] PDFBOX-5009, PDFBOX-3953: improve javadoc > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.22, 3.0.0 PDFBox > > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227989#comment-17227989 ] ASF subversion and git services commented on PDFBOX-5009: - Commit 1883202 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1883202 ] PDFBOX-5009, PDFBOX-3953: improve javadoc > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.22, 3.0.0 PDFBox > > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226945#comment-17226945 ] ASF subversion and git services commented on PDFBOX-5009: - Commit 1883149 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1883149 ] PDFBOX-5009, PDFBOX-3953: prevent stack overflow with malformed PDFs > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.22, 3.0.0 PDFBox > > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226943#comment-17226943 ] ASF subversion and git services commented on PDFBOX-5009: - Commit 1883148 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1883148 ] PDFBOX-5009, PDFBOX-3953: prevent stack overflow with malformed PDFs > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.22, 3.0.0 PDFBox > > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226931#comment-17226931 ] Tilman Hausherr commented on PDFBOX-5009: - OK, the reason for that one is that the code change only fixes the iterator. PDFDebugger doesn't use it (and gets another problem). I have displayed the stack overflow message and them ended the application. > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.22, 3.0.0 PDFBox > > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226911#comment-17226911 ] Tilman Hausherr commented on PDFBOX-5009: - Thanks, I'll do that, also assign null to the set after construction to lessen memory usage. For some reason, I can't display oleObject1_cleaned.pdf with PDFDebugger, but it must have worked at some time this morning because it is in the last files list. Weird... > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.22, 3.0.0 PDFBox > > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1722#comment-1722 ] Andreas Lehmkühler commented on PDFBOX-5009: [~tilman] Looks good to me, just one small improvement for pdfs consisting of a lot of pages. To minimize the number of elements within the set, it should be sufficient to store the page tree nodes: {code} if (set.contains(kid)) { LOG.error("This node has already been visited"); continue; } else if (kid.containsKey(COSName.KIDS)) { set.add(kid); } {code} > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0.22, 3.0.0 PDFBox > > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226501#comment-17226501 ] Tilman Hausherr commented on PDFBOX-5009: - I'm able to catch this by using a set to prevent a recursive call with the same parameter: {code:java} private final class PageIterator implements Iterator { private final Queue queue = new ArrayDeque<>(); private Set set = new HashSet<>(); private PageIterator(COSDictionary node) { enqueueKids(node); } private void enqueueKids(COSDictionary node) { if (isPageTreeNode(node)) { List kids = getKids(node); for (COSDictionary kid : kids) { // ** NEW ** if (set.contains(kid)) { LOG.error("This node has already been visited"); continue; } else { set.add(kid); } enqueueKids(kid); } } else { queue.add(node); } } {code} > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow
[ https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226486#comment-17226486 ] Tilman Hausherr commented on PDFBOX-5009: - I added some logging and stack tracing to see when it starts: {noformat} 020-11-05 05:19:14 WARN PDPageTree:154 - i = 4, element is: COSObject{207, 0} 2020-11-05 05:19:14 WARN PDPageTree:155 - COSDictionary expected, but got null java.lang.Exception at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:157) at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:41) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:184) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:187) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:187) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:173) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:167) at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:126) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:289) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:241) at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:364) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:267) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:98) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:57) 2020-11-05 05:19:14 WARN PDPageTree:154 - i = 5, element is: COSObject{214, 0} 2020-11-05 05:19:14 WARN PDPageTree:155 - COSDictionary expected, but got null java.lang.Exception at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:157) at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:41) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:184) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:187) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:187) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:173) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:167) at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:126) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:289) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:241) at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:364) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:267) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:98) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:57) {noformat} > Corrupt PDF can lead to a StackOverflow > --- > > Key: PDFBOX-5009 > URL: https://issues.apache.org/jira/browse/PDFBOX-5009 > Project: PDFBox > Issue Type: Task > Components: Text extraction >Affects Versions: 2.0.21 >Reporter: Tim Allison >Priority: Minor > > See TIKA-3224. I confirmed this with 2.0.21 by calling the app's ExtractText > on the file posted on the Tika issue. > cc [~dadoonet] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org