Tim Allison edited comment on TIKA-2091 at 9/22/16 7:07 PM:

This particular exception is caused by Solr's {{MostlyPassthroughHtmlMapper}}.  

The triggering document has a run of ~50 <div> starts and then ~50+ <font> 
starts.  So, y, Tika limits nested elements to 100.

Tika's DefaultHtmlMapper only passes through a few handfuls of elements 
{{SAFE_ELEMENTS}}, not including <font> or <div>. 

Solr's MostlyPassThroughHtmlMapper passes through, well, mostly everything.

was (Author: talli...@mitre.org):
This particular exception is caused by Solr's {{MostlyPassthroughHtmlMapper}}.  
I suspect that there are some lopsided tags in the triggering file that are 
somehow filtered out by Tika's default HtmlMapper.

If you want us to change behavior in Tika, please open a separate ticket.

If you want a change in Solr, please open a ticket on their Jira.  

> regression: Zip bomb detected! for HTML file
> --------------------------------------------
>                 Key: TIKA-2091
>                 URL: https://issues.apache.org/jira/browse/TIKA-2091
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>         Environment: Debian jessie Linux, Oracle Java 8
>            Reporter: Rodrigo Rosenfeld Rosas
> Hi, while discussing an issue on Solr's mailing list it was suggested to me 
> to open a ticket here. Please let me know if this is not the proper place for 
> such ticket.
> After upgrading to latest Solr, this document is no longer indexing properly 
> in Solr. They told me they upgraded Tika from 1.7 to 1.13 in Solr 6.2. Before 
> the upgrade this documented was indexed as expected:
> https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm
> I hope a fix could go on time for 1.14 ;)
> Cheers.

This message was sent by Atlassian JIRA

Reply via email to