[
https://issues.apache.org/jira/browse/TIKA-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137497#comment-15137497
]
Pascal Essiembre commented on TIKA-741:
---------------------------------------
What? That easy? Those two simple lines did it in my local testing! :-) Many
thanks! I'll upload the fix on our end when I get a chance.
As far as sharing, I am all for it, but our changes to the Tika parser classes
for PDF are mainly to support PDFBox 2.0.0 which resolved several PDF issues
reported by some of our users (like better detecting of spaces between terms in
some PDF). Given PDFBox 2.0.0 is not out yet, are you open to upgrade Tika
code base to support that version of PDFBox (replacing support for PDFBox 1.x)?
I think I added a few more things that may not be (or was not at the time) in
Tika parser, like extracting XFA text. I can submit a patch for that as well
if you are open.
Thanks again.
> "Zip bomb" (XML nesting) detection is too strict
> ------------------------------------------------
>
> Key: TIKA-741
> URL: https://issues.apache.org/jira/browse/TIKA-741
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.10
> Reporter: Erik Hetzner
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 1.0
>
>
> I get "zip bomb" errors from many HTML documents, e.g.
> http://www.akhbaar.org/wesima_articles/index-20100101-82736.html
> Is there a way that the element nesting level could be made configurable? 30
> elements just doesn't seem to be enough.
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)