[ 
https://issues.apache.org/jira/browse/TIKA-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137497#comment-15137497
 ] 

Pascal Essiembre commented on TIKA-741:
---------------------------------------

What? That easy? Those two simple lines did it in my local testing! :-)   Many 
thanks!  I'll upload the fix on our end when I get a chance.

As far as sharing, I am all for it, but our changes to the Tika parser classes 
for PDF are mainly to support PDFBox 2.0.0 which resolved several PDF issues 
reported by some of our users (like better detecting of spaces between terms in 
some PDF).  Given PDFBox 2.0.0 is not out yet, are you open to upgrade Tika 
code base to support that version of PDFBox (replacing support for PDFBox 1.x)?

I think I added a few more things that may not be (or was not at the time) in 
Tika parser, like extracting XFA text.  I can submit a patch for that as well 
if you are open. 

Thanks again.

> "Zip bomb" (XML nesting) detection is too strict
> ------------------------------------------------
>
>                 Key: TIKA-741
>                 URL: https://issues.apache.org/jira/browse/TIKA-741
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Erik Hetzner
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.0
>
>
> I get "zip bomb" errors from many HTML documents, e.g. 
> http://www.akhbaar.org/wesima_articles/index-20100101-82736.html
> Is there a way that the element nesting level could be made configurable? 30 
> elements just doesn't seem to be enough.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to