Rodrigo Rosenfeld Rosas commented on TIKA-2091:

Hmm, I'll try to get more details about it, but it will require some time as I 
have to upgrade Solr in my development machine. The indexer will also split the 
document in pages using some criteria and will also index the individual pages 
separately. Maybe the error could happen in one of those pages, but I run tidy 
on them before indexing so I know they are welformed. Also, I think I remove 
the invalid header from Edgar before the html tag before indexing the full 
document. Anyway, as soon as I have more details I'll let you know. The indexer 
reported the document id which failed but not all details, like if it was an 
individual page for example...

> regression: Zip bomb detected! for HTML file
> --------------------------------------------
>                 Key: TIKA-2091
>                 URL: https://issues.apache.org/jira/browse/TIKA-2091
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>         Environment: Debian jessie Linux, Oracle Java 8
>            Reporter: Rodrigo Rosenfeld Rosas
>             Fix For: 1.7
> Hi, while discussing an issue on Solr's mailing list it was suggested to me 
> to open a ticket here. Please let me know if this is not the proper place for 
> such ticket.
> After upgrading to latest Solr, this document is no longer indexing properly 
> in Solr. They told me they upgraded Tika from 1.7 to 1.13 in Solr 6.2. Before 
> the upgrade this documented was indexed as expected:
> https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm
> I hope a fix could go on time for 1.14 ;)
> Cheers.

This message was sent by Atlassian JIRA

Reply via email to