[ 
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522515#comment-17522515
 ] 

David Avant commented on PDFBOX-5415:
-------------------------------------

To summarize the verdict from Michael Demey:   This PDF is wack.  :)
The looping logic is not infinite, but the exponential nature of the loop is so 
bad that our star may go supernova before Tika is finished parsing it.

As ridiculous as this PDF might be, I suspect we need to defend against it.   
Otherwise this becomes a potential Denial of Service attack.

Given Michael's description of the nature of the issue, does it seem plausible 
that this can be fixed within the parser itself?    Or do we need some external 
means of defense, like insulating the rest of the application by running Tika 
within its own thread?

> Infinite loop in ExtractText in 2.x branch on a specific pdf
> ------------------------------------------------------------
>
>                 Key: PDFBOX-5415
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5415
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.26
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: PDFBOX-5415-TIKA-3718-p10.pdf
>
>
> [~DavidAvant] reported an infinite loop in Tika and provided an example file. 
>  I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's 
> ExtractText.
> File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf
> Adobe and a slightly out of date pdftotext also have problems with this file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to