[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-08 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228004#comment-17228004
 ] 

Tilman Hausherr commented on PDFBOX-5009:
-

I thought I was getting a stack overflow with PDFDebugger but no, this was 
probably because of some local changes.

Doing PDPage.get() on such files can bring an unchecked exception. Still bad, 
but not as bad as a stack overflow. So I have documented it. Preventing it 
seems tricky and would require an API change. It should be done in a separate 
issue.

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227987#comment-17227987
 ] 

ASF subversion and git services commented on PDFBOX-5009:
-

Commit 1883201 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1883201 ]

PDFBOX-5009, PDFBOX-3953: improve javadoc

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227989#comment-17227989
 ] 

ASF subversion and git services commented on PDFBOX-5009:
-

Commit 1883202 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1883202 ]

PDFBOX-5009, PDFBOX-3953: improve javadoc

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226945#comment-17226945
 ] 

ASF subversion and git services commented on PDFBOX-5009:
-

Commit 1883149 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1883149 ]

PDFBOX-5009, PDFBOX-3953: prevent stack overflow with malformed PDFs

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-05 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226943#comment-17226943
 ] 

ASF subversion and git services commented on PDFBOX-5009:
-

Commit 1883148 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1883148 ]

PDFBOX-5009, PDFBOX-3953: prevent stack overflow with malformed PDFs

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226931#comment-17226931
 ] 

Tilman Hausherr commented on PDFBOX-5009:
-

OK, the reason for that one is that the code change only fixes the iterator. 
PDFDebugger doesn't use it (and gets another problem). I have displayed the 
stack overflow message and them ended the application.

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226911#comment-17226911
 ] 

Tilman Hausherr commented on PDFBOX-5009:
-

Thanks, I'll do that, also assign null to the set after construction to lessen 
memory usage.

For some reason, I can't display oleObject1_cleaned.pdf with PDFDebugger, but 
it must have worked at some time this morning because it is in the last files 
list. Weird...

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1722#comment-1722
 ] 

Andreas Lehmkühler commented on PDFBOX-5009:


[~tilman] Looks good to me, just one small improvement for pdfs consisting of a 
lot of pages. To minimize the number of elements within the set, it should be 
sufficient to store the page tree nodes:
{code}
if (set.contains(kid))
{
LOG.error("This node has already been visited");
continue;
}
else if (kid.containsKey(COSName.KIDS))
{
set.add(kid);
}
{code}


> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0.22, 3.0.0 PDFBox
>
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-04 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226501#comment-17226501
 ] 

Tilman Hausherr commented on PDFBOX-5009:
-

I'm able to catch this by using a set to prevent a recursive call with the same 
parameter:
{code:java}
private final class PageIterator implements Iterator
{
private final Queue queue = new ArrayDeque<>();
private Set set = new HashSet<>();

private PageIterator(COSDictionary node)
{
enqueueKids(node);
}
private void enqueueKids(COSDictionary node)
{
if (isPageTreeNode(node))
{
List kids = getKids(node);
for (COSDictionary kid : kids)
{

// ** NEW **
if (set.contains(kid))
{
LOG.error("This node has already been visited");
continue;
}
else
{
set.add(kid);
}

enqueueKids(kid);
}
}
else
{
queue.add(node);
}
}
 {code}
 

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-04 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226486#comment-17226486
 ] 

Tilman Hausherr commented on PDFBOX-5009:
-

I added some logging and stack tracing to see when it starts:
{noformat}
020-11-05 05:19:14 WARN  PDPageTree:154 - i = 4, element is: COSObject{207, 0}
2020-11-05 05:19:14 WARN  PDPageTree:155 - COSDictionary expected, but got null
java.lang.Exception
at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:157)
at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:41)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:184)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:187)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:187)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:173)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:167)
at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:126)
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:289)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:241)
at 
org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:364)
at 
org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:267)
at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:98)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:57)
2020-11-05 05:19:14 WARN  PDPageTree:154 - i = 5, element is: COSObject{214, 0}
2020-11-05 05:19:14 WARN  PDPageTree:155 - COSDictionary expected, but got null
java.lang.Exception
at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:157)
at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:41)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:184)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:187)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:187)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:173)
at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.(PDPageTree.java:167)
at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:126)
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:289)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:241)
at 
org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:364)
at 
org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:267)
at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:98)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:57) {noformat}

> Corrupt PDF can lead to a StackOverflow
> ---
>
> Key: PDFBOX-5009
> URL: https://issues.apache.org/jira/browse/PDFBOX-5009
> Project: PDFBox
>  Issue Type: Task
>  Components: Text extraction
>Affects Versions: 2.0.21
>Reporter: Tim Allison
>Priority: Minor
>
> See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
> on the file posted on the Tika issue.
> cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org