[jira] [Commented] (TIKA-2091) regression: Zip bomb detected! for HTML file
[ https://issues.apache.org/jira/browse/TIKA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428933#comment-16428933 ] Harinder commented on TIKA-2091: Hello [~talli...@mitre.org], you mentioned above that the zip bomb issue when extracting HTML files does not occur if you don't use Solr's custom MostlyPassthroughHtmlMapper. How would I go about configuring Solr to use Tika's default extractor? I have a thread open at SO with full details, [see here|https://stackoverflow.com/questions/49699256/zip-bomb-exception-while-sending-html-document-to-solr]. Thanks! > regression: Zip bomb detected! for HTML file > > > Key: TIKA-2091 > URL: https://issues.apache.org/jira/browse/TIKA-2091 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: Debian jessie Linux, Oracle Java 8 >Reporter: Rodrigo Rosenfeld Rosas >Priority: Major > > Hi, while discussing an issue on Solr's mailing list it was suggested to me > to open a ticket here. Please let me know if this is not the proper place for > such ticket. > After upgrading to latest Solr, this document is no longer indexing properly > in Solr. They told me they upgraded Tika from 1.7 to 1.13 in Solr 6.2. Before > the upgrade this documented was indexed as expected: > https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm > I hope a fix could go on time for 1.14 ;) > Cheers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: rfc822 updates and 1.18
Awesomeness From: "Allison, Timothy B."Reply-To: "dev@tika.apache.org" Date: Friday, April 6, 2018 at 11:30 AM To: "dev@tika.apache.org" Subject: rfc822 updates and 1.18 All, I made two updates to our handling of rfc822 files and reran the eval against what Tika 1.18-SNAPSHOT thinks are rfc822 files. The reports are available here: http://162.242.228.174/reports/tika_1_18-SNAPSHOT_rfc822_concat_reports.tbz I _think_ we're good to go... I'll roll the RC1 on Monday unless there are objections. Best, Tim
[jira] [Commented] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser
[ https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428820#comment-16428820 ] Hudson commented on TIKA-2625: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #16 (See [https://builds.apache.org/job/tika-branch-1x/16/]) TIKA-2625 (tallison: [https://github.com/apache/tika/commit/b928453caf6bb557748168418e49cb8a112d996f]) * (edit) tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java * (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java * (add) tika-app/src/test/java/org/apache/tika/extractor/TestEmbeddedDocumentUtil.java > EmbeddedDocUtil not correctly handling doubly decorated parsers in > tryToFindExistingLeafParser > -- > > Key: TIKA-2625 > URL: https://issues.apache.org/jira/browse/TIKA-2625 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > > In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is > a subtle bug that prevents extraction of text when the AutoDetectParser is > wrapped in a DigestingParser in a RecursiveParserWrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file
[ https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428821#comment-16428821 ] Hudson commented on TIKA-2626: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #16 (See [https://builds.apache.org/job/tika-branch-1x/16/]) TIKA-2626 (tallison: [https://github.com/apache/tika/commit/d1a7cab657539d3ff21fd6d64a89c9fe588c9cfd]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java > RFC822Parser crazily slower because of creation of new Detector on each file > > > Key: TIKA-2626 > URL: https://issues.apache.org/jira/browse/TIKA-2626 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Blocker > > RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to > determine html vs text than to create a new Detector for every file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file
[ https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428806#comment-16428806 ] Hudson commented on TIKA-2626: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1465 (See [https://builds.apache.org/job/Tika-trunk/1465/]) TIKA-2626 (tallison: [https://github.com/apache/tika/commit/c8b9b4409c72ded92d588660274977d5a6fdb539]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java > RFC822Parser crazily slower because of creation of new Detector on each file > > > Key: TIKA-2626 > URL: https://issues.apache.org/jira/browse/TIKA-2626 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Blocker > > RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to > determine html vs text than to create a new Detector for every file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file
[ https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2626. --- Resolution: Fixed > RFC822Parser crazily slower because of creation of new Detector on each file > > > Key: TIKA-2626 > URL: https://issues.apache.org/jira/browse/TIKA-2626 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Blocker > > RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to > determine html vs text than to create a new Detector for every file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser
[ https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2625: -- Issue Type: Bug (was: Task) > EmbeddedDocUtil not correctly handling doubly decorated parsers in > tryToFindExistingLeafParser > -- > > Key: TIKA-2625 > URL: https://issues.apache.org/jira/browse/TIKA-2625 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > > In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is > a subtle bug that prevents extraction of text when the AutoDetectParser is > wrapped in a DigestingParser in a RecursiveParserWrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser
[ https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2625. --- Resolution: Fixed > EmbeddedDocUtil not correctly handling doubly decorated parsers in > tryToFindExistingLeafParser > -- > > Key: TIKA-2625 > URL: https://issues.apache.org/jira/browse/TIKA-2625 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > > In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is > a subtle bug that prevents extraction of text when the AutoDetectParser is > wrapped in a DigestingParser in a RecursiveParserWrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file
[ https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2626: -- Issue Type: Bug (was: Task) > RFC822Parser crazily slower because of creation of new Detector on each file > > > Key: TIKA-2626 > URL: https://issues.apache.org/jira/browse/TIKA-2626 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > > RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to > determine html vs text than to create a new Detector for every file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file
[ https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2626: -- Priority: Blocker (was: Major) > RFC822Parser crazily slower because of creation of new Detector on each file > > > Key: TIKA-2626 > URL: https://issues.apache.org/jira/browse/TIKA-2626 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Blocker > > RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to > determine html vs text than to create a new Detector for every file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file
[ https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428770#comment-16428770 ] Hudson commented on TIKA-2626: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #229 (See [https://builds.apache.org/job/tika-2.x-windows/229/]) TIKA-2626 (tallison: rev c8b9b4409c72ded92d588660274977d5a6fdb539) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java > RFC822Parser crazily slower because of creation of new Detector on each file > > > Key: TIKA-2626 > URL: https://issues.apache.org/jira/browse/TIKA-2626 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to > determine html vs text than to create a new Detector for every file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
rfc822 updates and 1.18
All, I made two updates to our handling of rfc822 files and reran the eval against what Tika 1.18-SNAPSHOT thinks are rfc822 files. The reports are available here: http://162.242.228.174/reports/tika_1_18-SNAPSHOT_rfc822_concat_reports.tbz I _think_ we're good to go... I'll roll the RC1 on Monday unless there are objections. Best, Tim
[jira] [Commented] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser
[ https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428568#comment-16428568 ] Hudson commented on TIKA-2625: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #228 (See [https://builds.apache.org/job/tika-2.x-windows/228/]) TIKA-2625 (tallison: rev d502a4b31348a0176a3999f07cf970bfa6a9dac1) * (edit) tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java * (add) tika-app/src/test/java/org/apache/tika/extractor/TestEmbeddedDocumentUtil.java * (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java > EmbeddedDocUtil not correctly handling doubly decorated parsers in > tryToFindExistingLeafParser > -- > > Key: TIKA-2625 > URL: https://issues.apache.org/jira/browse/TIKA-2625 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is > a subtle bug that prevents extraction of text when the AutoDetectParser is > wrapped in a DigestingParser in a RecursiveParserWrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser
[ https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428506#comment-16428506 ] Hudson commented on TIKA-2625: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1464 (See [https://builds.apache.org/job/Tika-trunk/1464/]) TIKA-2625 (tallison: [https://github.com/apache/tika/commit/d502a4b31348a0176a3999f07cf970bfa6a9dac1]) * (edit) tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java * (add) tika-app/src/test/java/org/apache/tika/extractor/TestEmbeddedDocumentUtil.java * (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java > EmbeddedDocUtil not correctly handling doubly decorated parsers in > tryToFindExistingLeafParser > -- > > Key: TIKA-2625 > URL: https://issues.apache.org/jira/browse/TIKA-2625 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is > a subtle bug that prevents extraction of text when the AutoDetectParser is > wrapped in a DigestingParser in a RecursiveParserWrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file
Tim Allison created TIKA-2626: - Summary: RFC822Parser crazily slower because of creation of new Detector on each file Key: TIKA-2626 URL: https://issues.apache.org/jira/browse/TIKA-2626 Project: Tika Issue Type: Task Reporter: Tim Allison RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to determine html vs text than to create a new Detector for every file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser
Tim Allison created TIKA-2625: - Summary: EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser Key: TIKA-2625 URL: https://issues.apache.org/jira/browse/TIKA-2625 Project: Tika Issue Type: Task Reporter: Tim Allison In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is a subtle bug that prevents extraction of text when the AutoDetectParser is wrapped in a DigestingParser in a RecursiveParserWrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)