date:20220303

[GitHub] [tika] dependabot[bot] opened a new pull request #522: Bump twelvemonkeys.version from 3.8.1 to 3.8.2

2022-03-03 Thread GitBox



dependabot[bot] opened a new pull request #522:
URL: https://github.com/apache/tika/pull/522


   Bumps `twelvemonkeys.version` from 3.8.1 to 3.8.2.
   Updates `common-io` from 3.8.1 to 3.8.2
   
   Updates `imageio-bmp` from 3.8.1 to 3.8.2
   
   Updates `imageio-jpeg` from 3.8.1 to 3.8.2
   
   Updates `imageio-psd` from 3.8.1 to 3.8.2
   
   Updates `imageio-tiff` from 3.8.1 to 3.8.2
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [tika] dependabot[bot] opened a new pull request #521: Bump maven-javadoc-plugin from 3.3.1 to 3.3.2

2022-03-03 Thread GitBox



dependabot[bot] opened a new pull request #521:
URL: https://github.com/apache/tika/pull/521


   Bumps [maven-javadoc-plugin](https://github.com/apache/maven-javadoc-plugin) 
from 3.3.1 to 3.3.2.
   
   Commits
   
   https://github.com/apache/maven-javadoc-plugin/commit/50a41f78278c1d5957cb03e89223fcbc640f1b68";>50a41f7
 [maven-release-plugin] prepare release maven-javadoc-plugin-3.3.2
   https://github.com/apache/maven-javadoc-plugin/commit/5af4519ce17308f35d4cfb553d0af8168cac35f4";>5af4519
 [MJAVADOC-705] Upgrade Maven Reporting API to 3.1.0
   https://github.com/apache/maven-javadoc-plugin/commit/ee4132f9d0f6f412784e108305524cf3b3a3009a";>ee4132f
 [MJAVADOC-704] Javadoc plugin does not respect jdkToolchain
   https://github.com/apache/maven-javadoc-plugin/commit/651b98e6951ee2e3d8fefa1bcb3629f1dae763be";>651b98e
 Bump doxia-site-renderer from 1.10 to 1.11.1
   https://github.com/apache/maven-javadoc-plugin/commit/db20fddd4eb443711948645613c03a4ccc516dab";>db20fdd
 Bump plexus-archiver from 4.2.5 to 4.2.6
   https://github.com/apache/maven-javadoc-plugin/commit/b51c5d801c1e329b1e59e7b66142c8fd9c0ba6af";>b51c5d8
 [MJAVADOC-694] Avoid empty warn message from getResolvePathResult
   https://github.com/apache/maven-javadoc-plugin/commit/6b1515ed43826417cf4cbad88be1107e92cba3f5";>6b1515e
 Bump httpcore from 4.4.14 to 4.4.15
   https://github.com/apache/maven-javadoc-plugin/commit/a4aa7dcdc22eda19a08cc3c3e231631d422f1725";>a4aa7dc
 (doc) fix javadoc issues
   https://github.com/apache/maven-javadoc-plugin/commit/3309cc2ea03237576f3d702cffe1d628c3ea0d12";>3309cc2
 Bump maven-project-info-reports-plugin to 3.1.2
   https://github.com/apache/maven-javadoc-plugin/commit/51a2c3f803a0124c2591e46925210da705042f19";>51a2c3f
 Bump maven-javadoc-plugin to 3.3.1
   Additional commits viewable in https://github.com/apache/maven-javadoc-plugin/compare/maven-javadoc-plugin-3.3.1...maven-javadoc-plugin-3.3.2";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.maven.plugins:maven-javadoc-plugin&package-manager=maven&previous-version=3.3.1&new-version=3.3.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501040#comment-17501040
 ] 

Hudson commented on TIKA-3687:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #475 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/475/])
[TIKA-3687] Fix email detection (#520) (github: 
[https://github.com/apache/tika/commit/5444f80d1b71845ff47c91376f5c90a40dae5a4f])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/test/resources/test-documents/testRFC822-ARC
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Fix For: 2.3.1
>
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500990#comment-17500990
 ] 

Tim Allison edited comment on TIKA-3687 at 3/3/22, 7:13 PM:


Thank you [~tguerin]!


was (Author: talli...@mitre.org):
Thank you!

> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Fix For: 2.3.1
>
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Tim Allison (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3687.
---
Fix Version/s: 2.3.1
   Resolution: Fixed

Thank you!

> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Fix For: 2.3.1
>
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500986#comment-17500986
 ] 

ASF GitHub Bot commented on TIKA-3687:
--

tballison merged pull request #520:
URL: https://github.com/apache/tika/pull/520


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [tika] tballison merged pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox



tballison merged pull request #520:
URL: https://github.com/apache/tika/pull/520


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500974#comment-17500974
 ] 

ASF GitHub Bot commented on TIKA-3687:
--

SchwingSK commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818959865



##
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##
@@ -6422,7 +6422,7 @@
   
-  
+  

Review comment:
   Good point, I like your solution better. Code changed accordingly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [tika] SchwingSK commented on a change in pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox



SchwingSK commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818959865



##
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##
@@ -6422,7 +6422,7 @@
   
-  
+  

Review comment:
   Good point, I like your solution better. Code changed accordingly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira



[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500962#comment-17500962
 ] 

Thierry Guérin edited comment on TIKA-3687 at 3/3/22, 6:22 PM:
---

Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. 
Other solution was to increase 1024 to at least 8000 (I have another email in 
which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone 
here has a good idea on which version is the most efficient.

As of now, I only found examples where there was one 'Received:' header before 
the 'ARC*' headers, that's why I think that 1024 may be overkill.


was (Author: tguerin):
Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. 
Other solution was to increase 1024 to at least 8000 (I have another email in 
which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone 
here has a good idea on which version is the most efficient.

 

> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500966#comment-17500966
 ] 

ASF GitHub Bot commented on TIKA-3687:
--

tballison commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818943762



##
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##
@@ -6422,7 +6422,7 @@
   
-  
+  

Review comment:
   I worry about looking for X- anywhere in the first 1024 without 
requiring a \n before it.
   
   What would you think of adding something like this into the previous 
minShouldMatch=2 clause?
   `
   
   
   
   `
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [tika] tballison commented on a change in pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox



tballison commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818943762



##
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##
@@ -6422,7 +6422,7 @@
   
-  
+  

Review comment:
   I worry about looking for X- anywhere in the first 1024 without 
requiring a \n before it.
   
   What would you think of adding something like this into the previous 
minShouldMatch=2 clause?
   `
   
   
   
   `
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira



[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500962#comment-17500962
 ] 

Thierry Guérin commented on TIKA-3687:
--

Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. 
Other solution was to increase 1024 to at least 8000 (I have another email in 
which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone 
here has a good idea on which version is the most efficient.

 

> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC -* headers, but they're not the first one, so 
> the matcher that looks for ARC- headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thierry Guérin updated TIKA-3687:
-
Description: 
The attached email (which I redacted from a real email received from Office365) 
is detected a HTML.

This is because it contains ARC * headers, but they're not the first one, so 
the matcher that looks for ARC headers fails, and the matcher for regular 
'From' header also fails because the 'From' headers occurs after 1024 
characters.

  was:
The attached email (which I redacted from a real email received from Office365) 
is detected a HTML.

This is because it contains ARC -* headers, but they're not the first one, so 
the matcher that looks for ARC- headers fails, and the matcher for regular 
'From' header also fails because the 'From' headers occurs after 1024 
characters.


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500953#comment-17500953
 ] 

ASF GitHub Bot commented on TIKA-3687:
--

SchwingSK opened a new pull request #520:
URL: https://github.com/apache/tika/pull/520


   1024  is maybe a bit overkill for the X|DKIM|ARC headers lookahead ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC -* headers, but they're not the first one, so 
> the matcher that looks for ARC- headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [tika] SchwingSK opened a new pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox



SchwingSK opened a new pull request #520:
URL: https://github.com/apache/tika/pull/520


   1024  is maybe a bit overkill for the X|DKIM|ARC headers lookahead ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira

Thierry Guérin created TIKA-3687:


 Summary: Email file detected as text/html
 Key: TIKA-3687
 URL: https://issues.apache.org/jira/browse/TIKA-3687
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Thierry Guérin
 Attachments: testRFC822-ARC.eml

The attached email (which I redacted from a real email received from Office365) 
is detected a HTML.

This is because it contains ARC -* headers, but they're not the first one, so 
the matcher that looks for ARC- headers fails, and the matcher for regular 
'From' header also fails because the 'From' headers occurs after 1024 
characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500904#comment-17500904
 ] 

Tim Allison commented on TIKA-3668:
---

When I ran the same test against 1.26 with tesseract removed via tika config, 
this is a bit faster and a bit better on cpu.

{noformat}
~$ pidstat -p 261234 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

12:01:10 PM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
12:01:10 PM  10002612340.140.000.000.000.14 5  java

12:01:10 PM   UID   PIDusr-ms system-ms  guest-ms  Command
12:01:10 PM  1000261234396490 12570 0  java
{noformat}

> High CPU utilization in Tika 2.2.0
> --
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
>  Issue Type: Bug
>Reporter: Manjunath Dhongadi
>Priority: Major
> Attachments: tika-config-no-tess.xml, tika-config.xml
>
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in 
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)

[
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874
]

Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:46 PM:

Thank you. I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images
code in the PDFParser. With debugging and custom logging, I could see that
even running multi-threaded, the code works as expected. If the header says
no-ocr, pages aren't rendered in the PDFParser and inline images are not
extracted.

2) In a single thread, I ran all the files in our unit tests with custom
logging to detect if the TesseractOCRParser was being called on any of the file
types when the header was set to no_ocr. I couldn't find any problems. The
TesseractOCRParser was never called to parse.

3) I ran pidstat with three settings against all of our test files 10 times.
The client was single threaded. I ran pidstat against the forked process, not
the primary watcher process. The results all basically look the same to me.

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL
Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU)

11:31:47 AM UID PID%usr %system %guest %wait%CPU CPU
Command
11:31:47 AM 10002545950.160.000.000.000.17 2 java

11:31:47 AM UID PIDusr-ms system-ms guest-ms Command
11:31:47 AM 1000254595442080 11820 0 java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU)

11:08:39 AM UID PID%usr %system %guest %wait%CPU CPU
Command
11:08:39 AM 10002500330.160.000.000.000.17 5 java

11:08:39 AM UID PIDusr-ms system-ms guest-ms Command
11:08:39 AM 1000250033439390 11780 0 java

disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL
Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU)

11:16:50 AM UID PID%usr %system %guest %wait%CPU CPU
Command
11:16:50 AM 10002522280.160.000.000.000.17 5 java

11:16:50 AM UID PIDusr-ms system-ms guest-ms Command
11:16:50 AM 1000252228437250 12380 0 java
{noformat}

was (Author: talli...@mitre.org):
Thank you. I tried three things this morning.

3) I ran pidstat with three settings; the client was single threaded. I ran
pidstat against the forked process, not the primary watcher process. The
results all basically look the same to me.