[jira] [Commented] (TIKA-2091) regression: Zip bomb detected! for HTML file

2018-04-06 Thread Harinder (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428933#comment-16428933
 ] 

Harinder commented on TIKA-2091:


Hello [~talli...@mitre.org], you mentioned above that the zip bomb issue when 
extracting HTML files does not occur if you don't use Solr's custom 
MostlyPassthroughHtmlMapper.  
How would I go about configuring Solr to use Tika's default extractor? 

I have a thread open at SO with full details, [see 
here|https://stackoverflow.com/questions/49699256/zip-bomb-exception-while-sending-html-document-to-solr].

Thanks!

> regression: Zip bomb detected! for HTML file
> 
>
> Key: TIKA-2091
> URL: https://issues.apache.org/jira/browse/TIKA-2091
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: Debian jessie Linux, Oracle Java 8
>Reporter: Rodrigo Rosenfeld Rosas
>Priority: Major
>
> Hi, while discussing an issue on Solr's mailing list it was suggested to me 
> to open a ticket here. Please let me know if this is not the proper place for 
> such ticket.
> After upgrading to latest Solr, this document is no longer indexing properly 
> in Solr. They told me they upgraded Tika from 1.7 to 1.13 in Solr 6.2. Before 
> the upgrade this documented was indexed as expected:
> https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm
> I hope a fix could go on time for 1.14 ;)
> Cheers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: rfc822 updates and 1.18

2018-04-06 Thread Chris Mattmann
Awesomeness

 

 

 

From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Friday, April 6, 2018 at 11:30 AM
To: "dev@tika.apache.org" 
Subject: rfc822 updates and 1.18

 

All,

I made two updates to our handling of rfc822 files and reran the eval against 
what Tika 1.18-SNAPSHOT thinks are rfc822 files.  The reports are available 
here:

 

http://162.242.228.174/reports/tika_1_18-SNAPSHOT_rfc822_concat_reports.tbz

 

I _think_ we're good to go...  I'll roll the RC1 on Monday unless there are 
objections.

 

 Best,

 

  Tim

 

 



[jira] [Commented] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser

2018-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428820#comment-16428820
 ] 

Hudson commented on TIKA-2625:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #16 (See 
[https://builds.apache.org/job/tika-branch-1x/16/])
TIKA-2625 (tallison: 
[https://github.com/apache/tika/commit/b928453caf6bb557748168418e49cb8a112d996f])
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* (edit) 
tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java
* (add) 
tika-app/src/test/java/org/apache/tika/extractor/TestEmbeddedDocumentUtil.java


> EmbeddedDocUtil not correctly handling doubly decorated parsers in 
> tryToFindExistingLeafParser
> --
>
> Key: TIKA-2625
> URL: https://issues.apache.org/jira/browse/TIKA-2625
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is 
> a subtle bug that prevents extraction of text when the AutoDetectParser is 
> wrapped in a DigestingParser in a RecursiveParserWrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file

2018-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428821#comment-16428821
 ] 

Hudson commented on TIKA-2626:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #16 (See 
[https://builds.apache.org/job/tika-branch-1x/16/])
TIKA-2626 (tallison: 
[https://github.com/apache/tika/commit/d1a7cab657539d3ff21fd6d64a89c9fe588c9cfd])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java


> RFC822Parser crazily slower because of creation of new Detector on each file
> 
>
> Key: TIKA-2626
> URL: https://issues.apache.org/jira/browse/TIKA-2626
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Blocker
>
> RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to 
> determine html vs text than to create a new Detector for every file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file

2018-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428806#comment-16428806
 ] 

Hudson commented on TIKA-2626:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1465 (See 
[https://builds.apache.org/job/Tika-trunk/1465/])
TIKA-2626 (tallison: 
[https://github.com/apache/tika/commit/c8b9b4409c72ded92d588660274977d5a6fdb539])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java


> RFC822Parser crazily slower because of creation of new Detector on each file
> 
>
> Key: TIKA-2626
> URL: https://issues.apache.org/jira/browse/TIKA-2626
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Blocker
>
> RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to 
> determine html vs text than to create a new Detector for every file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file

2018-04-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2626.
---
Resolution: Fixed

> RFC822Parser crazily slower because of creation of new Detector on each file
> 
>
> Key: TIKA-2626
> URL: https://issues.apache.org/jira/browse/TIKA-2626
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Blocker
>
> RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to 
> determine html vs text than to create a new Detector for every file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser

2018-04-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2625:
--
Issue Type: Bug  (was: Task)

> EmbeddedDocUtil not correctly handling doubly decorated parsers in 
> tryToFindExistingLeafParser
> --
>
> Key: TIKA-2625
> URL: https://issues.apache.org/jira/browse/TIKA-2625
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is 
> a subtle bug that prevents extraction of text when the AutoDetectParser is 
> wrapped in a DigestingParser in a RecursiveParserWrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser

2018-04-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2625.
---
Resolution: Fixed

> EmbeddedDocUtil not correctly handling doubly decorated parsers in 
> tryToFindExistingLeafParser
> --
>
> Key: TIKA-2625
> URL: https://issues.apache.org/jira/browse/TIKA-2625
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is 
> a subtle bug that prevents extraction of text when the AutoDetectParser is 
> wrapped in a DigestingParser in a RecursiveParserWrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file

2018-04-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2626:
--
Issue Type: Bug  (was: Task)

> RFC822Parser crazily slower because of creation of new Detector on each file
> 
>
> Key: TIKA-2626
> URL: https://issues.apache.org/jira/browse/TIKA-2626
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to 
> determine html vs text than to create a new Detector for every file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file

2018-04-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2626:
--
Priority: Blocker  (was: Major)

> RFC822Parser crazily slower because of creation of new Detector on each file
> 
>
> Key: TIKA-2626
> URL: https://issues.apache.org/jira/browse/TIKA-2626
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Blocker
>
> RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to 
> determine html vs text than to create a new Detector for every file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file

2018-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428770#comment-16428770
 ] 

Hudson commented on TIKA-2626:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #229 (See 
[https://builds.apache.org/job/tika-2.x-windows/229/])
TIKA-2626 (tallison: rev c8b9b4409c72ded92d588660274977d5a6fdb539)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java


> RFC822Parser crazily slower because of creation of new Detector on each file
> 
>
> Key: TIKA-2626
> URL: https://issues.apache.org/jira/browse/TIKA-2626
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to 
> determine html vs text than to create a new Detector for every file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


rfc822 updates and 1.18

2018-04-06 Thread Allison, Timothy B.
All,
I made two updates to our handling of rfc822 files and reran the eval against 
what Tika 1.18-SNAPSHOT thinks are rfc822 files.  The reports are available 
here:

http://162.242.228.174/reports/tika_1_18-SNAPSHOT_rfc822_concat_reports.tbz

I _think_ we're good to go...  I'll roll the RC1 on Monday unless there are 
objections.

 Best,

  Tim



[jira] [Commented] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser

2018-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428568#comment-16428568
 ] 

Hudson commented on TIKA-2625:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #228 (See 
[https://builds.apache.org/job/tika-2.x-windows/228/])
TIKA-2625 (tallison: rev d502a4b31348a0176a3999f07cf970bfa6a9dac1)
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* (add) 
tika-app/src/test/java/org/apache/tika/extractor/TestEmbeddedDocumentUtil.java
* (edit) 
tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java


> EmbeddedDocUtil not correctly handling doubly decorated parsers in 
> tryToFindExistingLeafParser
> --
>
> Key: TIKA-2625
> URL: https://issues.apache.org/jira/browse/TIKA-2625
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is 
> a subtle bug that prevents extraction of text when the AutoDetectParser is 
> wrapped in a DigestingParser in a RecursiveParserWrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser

2018-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428506#comment-16428506
 ] 

Hudson commented on TIKA-2625:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1464 (See 
[https://builds.apache.org/job/Tika-trunk/1464/])
TIKA-2625 (tallison: 
[https://github.com/apache/tika/commit/d502a4b31348a0176a3999f07cf970bfa6a9dac1])
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* (add) 
tika-app/src/test/java/org/apache/tika/extractor/TestEmbeddedDocumentUtil.java
* (edit) 
tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java


> EmbeddedDocUtil not correctly handling doubly decorated parsers in 
> tryToFindExistingLeafParser
> --
>
> Key: TIKA-2625
> URL: https://issues.apache.org/jira/browse/TIKA-2625
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is 
> a subtle bug that prevents extraction of text when the AutoDetectParser is 
> wrapped in a DigestingParser in a RecursiveParserWrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2626) RFC822Parser crazily slower because of creation of new Detector on each file

2018-04-06 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2626:
-

 Summary: RFC822Parser crazily slower because of creation of new 
Detector on each file
 Key: TIKA-2626
 URL: https://issues.apache.org/jira/browse/TIKA-2626
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


RFC822 parser is crazily slower in 1.18-SNAPSHOT. Better to use regexes to 
determine html vs text than to create a new Detector for every file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2625) EmbeddedDocUtil not correctly handling doubly decorated parsers in tryToFindExistingLeafParser

2018-04-06 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2625:
-

 Summary: EmbeddedDocUtil not correctly handling doubly decorated 
parsers in tryToFindExistingLeafParser
 Key: TIKA-2625
 URL: https://issues.apache.org/jira/browse/TIKA-2625
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


In reviewing diffs with rfc822 in Tika 1.17 and Tika 1.18-SNAPSHOT, there is a 
subtle bug that prevents extraction of text when the AutoDetectParser is 
wrapped in a DigestingParser in a RecursiveParserWrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)