[GitHub] [tika] dependabot[bot] opened a new pull request #522: Bump twelvemonkeys.version from 3.8.1 to 3.8.2

2022-03-03 Thread GitBox


dependabot[bot] opened a new pull request #522:
URL: https://github.com/apache/tika/pull/522


   Bumps `twelvemonkeys.version` from 3.8.1 to 3.8.2.
   Updates `common-io` from 3.8.1 to 3.8.2
   
   Updates `imageio-bmp` from 3.8.1 to 3.8.2
   
   Updates `imageio-jpeg` from 3.8.1 to 3.8.2
   
   Updates `imageio-psd` from 3.8.1 to 3.8.2
   
   Updates `imageio-tiff` from 3.8.1 to 3.8.2
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [tika] dependabot[bot] opened a new pull request #521: Bump maven-javadoc-plugin from 3.3.1 to 3.3.2

2022-03-03 Thread GitBox


dependabot[bot] opened a new pull request #521:
URL: https://github.com/apache/tika/pull/521


   Bumps [maven-javadoc-plugin](https://github.com/apache/maven-javadoc-plugin) 
from 3.3.1 to 3.3.2.
   
   Commits
   
   https://github.com/apache/maven-javadoc-plugin/commit/50a41f78278c1d5957cb03e89223fcbc640f1b68";>50a41f7
 [maven-release-plugin] prepare release maven-javadoc-plugin-3.3.2
   https://github.com/apache/maven-javadoc-plugin/commit/5af4519ce17308f35d4cfb553d0af8168cac35f4";>5af4519
 [MJAVADOC-705] Upgrade Maven Reporting API to 3.1.0
   https://github.com/apache/maven-javadoc-plugin/commit/ee4132f9d0f6f412784e108305524cf3b3a3009a";>ee4132f
 [MJAVADOC-704] Javadoc plugin does not respect jdkToolchain
   https://github.com/apache/maven-javadoc-plugin/commit/651b98e6951ee2e3d8fefa1bcb3629f1dae763be";>651b98e
 Bump doxia-site-renderer from 1.10 to 1.11.1
   https://github.com/apache/maven-javadoc-plugin/commit/db20fddd4eb443711948645613c03a4ccc516dab";>db20fdd
 Bump plexus-archiver from 4.2.5 to 4.2.6
   https://github.com/apache/maven-javadoc-plugin/commit/b51c5d801c1e329b1e59e7b66142c8fd9c0ba6af";>b51c5d8
 [MJAVADOC-694] Avoid empty warn message from getResolvePathResult
   https://github.com/apache/maven-javadoc-plugin/commit/6b1515ed43826417cf4cbad88be1107e92cba3f5";>6b1515e
 Bump httpcore from 4.4.14 to 4.4.15
   https://github.com/apache/maven-javadoc-plugin/commit/a4aa7dcdc22eda19a08cc3c3e231631d422f1725";>a4aa7dc
 (doc) fix javadoc issues
   https://github.com/apache/maven-javadoc-plugin/commit/3309cc2ea03237576f3d702cffe1d628c3ea0d12";>3309cc2
 Bump maven-project-info-reports-plugin to 3.1.2
   https://github.com/apache/maven-javadoc-plugin/commit/51a2c3f803a0124c2591e46925210da705042f19";>51a2c3f
 Bump maven-javadoc-plugin to 3.3.1
   Additional commits viewable in https://github.com/apache/maven-javadoc-plugin/compare/maven-javadoc-plugin-3.3.1...maven-javadoc-plugin-3.3.2";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.maven.plugins:maven-javadoc-plugin&package-manager=maven&previous-version=3.3.1&new-version=3.3.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501040#comment-17501040
 ] 

Hudson commented on TIKA-3687:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #475 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/475/])
[TIKA-3687] Fix email detection (#520) (github: 
[https://github.com/apache/tika/commit/5444f80d1b71845ff47c91376f5c90a40dae5a4f])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/test/resources/test-documents/testRFC822-ARC
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Fix For: 2.3.1
>
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500990#comment-17500990
 ] 

Tim Allison edited comment on TIKA-3687 at 3/3/22, 7:13 PM:


Thank you [~tguerin]!


was (Author: talli...@mitre.org):
Thank you!

> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Fix For: 2.3.1
>
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3687.
---
Fix Version/s: 2.3.1
   Resolution: Fixed

Thank you!

> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Fix For: 2.3.1
>
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500986#comment-17500986
 ] 

ASF GitHub Bot commented on TIKA-3687:
--

tballison merged pull request #520:
URL: https://github.com/apache/tika/pull/520


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [tika] tballison merged pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox


tballison merged pull request #520:
URL: https://github.com/apache/tika/pull/520


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500974#comment-17500974
 ] 

ASF GitHub Bot commented on TIKA-3687:
--

SchwingSK commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818959865



##
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##
@@ -6422,7 +6422,7 @@
   
-  
+  

Review comment:
   Good point, I like your solution better. Code changed accordingly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [tika] SchwingSK commented on a change in pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox


SchwingSK commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818959865



##
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##
@@ -6422,7 +6422,7 @@
   
-  
+  

Review comment:
   Good point, I like your solution better. Code changed accordingly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500962#comment-17500962
 ] 

Thierry Guérin edited comment on TIKA-3687 at 3/3/22, 6:22 PM:
---

Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. 
Other solution was to increase 1024 to at least 8000 (I have another email in 
which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone 
here has a good idea on which version is the most efficient.

As of now, I only found examples where there was one 'Received:' header before 
the 'ARC*' headers, that's why I think that 1024 may be overkill.


was (Author: tguerin):
Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. 
Other solution was to increase 1024 to at least 8000 (I have another email in 
which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone 
here has a good idea on which version is the most efficient.

 

> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500966#comment-17500966
 ] 

ASF GitHub Bot commented on TIKA-3687:
--

tballison commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818943762



##
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##
@@ -6422,7 +6422,7 @@
   
-  
+  

Review comment:
   I worry about looking for X- anywhere in the first 1024 without 
requiring a \n before it.
   
   What would you think of adding something like this into the previous 
minShouldMatch=2 clause?
   `
   
   
   
   `
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [tika] tballison commented on a change in pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox


tballison commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818943762



##
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##
@@ -6422,7 +6422,7 @@
   
-  
+  

Review comment:
   I worry about looking for X- anywhere in the first 1024 without 
requiring a \n before it.
   
   What would you think of adding something like this into the previous 
minShouldMatch=2 clause?
   `
   
   
   
   `
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500962#comment-17500962
 ] 

Thierry Guérin commented on TIKA-3687:
--

Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. 
Other solution was to increase 1024 to at least 8000 (I have another email in 
which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone 
here has a good idea on which version is the most efficient.

 

> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC -* headers, but they're not the first one, so 
> the matcher that looks for ARC- headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thierry Guérin updated TIKA-3687:
-
Description: 
The attached email (which I redacted from a real email received from Office365) 
is detected a HTML.

This is because it contains ARC * headers, but they're not the first one, so 
the matcher that looks for ARC headers fails, and the matcher for regular 
'From' header also fails because the 'From' headers occurs after 1024 
characters.

  was:
The attached email (which I redacted from a real email received from Office365) 
is detected a HTML.

This is because it contains ARC -* headers, but they're not the first one, so 
the matcher that looks for ARC- headers fails, and the matcher for regular 
'From' header also fails because the 'From' headers occurs after 1024 
characters.


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500953#comment-17500953
 ] 

ASF GitHub Bot commented on TIKA-3687:
--

SchwingSK opened a new pull request #520:
URL: https://github.com/apache/tika/pull/520


   1024  is maybe a bit overkill for the X|DKIM|ARC headers lookahead ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Email file detected as text/html
> 
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Thierry Guérin
>Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC -* headers, but they're not the first one, so 
> the matcher that looks for ARC- headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [tika] SchwingSK opened a new pull request #520: Fix email detection (TIKA-3687)

2022-03-03 Thread GitBox


SchwingSK opened a new pull request #520:
URL: https://github.com/apache/tika/pull/520


   1024  is maybe a bit overkill for the X|DKIM|ARC headers lookahead ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (TIKA-3687) Email file detected as text/html

2022-03-03 Thread Jira
Thierry Guérin created TIKA-3687:


 Summary: Email file detected as text/html
 Key: TIKA-3687
 URL: https://issues.apache.org/jira/browse/TIKA-3687
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Thierry Guérin
 Attachments: testRFC822-ARC.eml

The attached email (which I redacted from a real email received from Office365) 
is detected a HTML.

This is because it contains ARC -* headers, but they're not the first one, so 
the matcher that looks for ARC- headers fails, and the matcher for regular 
'From' header also fails because the 'From' headers occurs after 1024 
characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500904#comment-17500904
 ] 

Tim Allison commented on TIKA-3668:
---

When I ran the same test against 1.26 with tesseract removed via tika config, 
this is a bit faster and a bit better on cpu.

{noformat}
~$ pidstat -p 261234 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

12:01:10 PM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
12:01:10 PM  10002612340.140.000.000.000.14 5  java

12:01:10 PM   UID   PIDusr-ms system-ms  guest-ms  Command
12:01:10 PM  1000261234396490 12570 0  java
{noformat}

> High CPU utilization in Tika 2.2.0
> --
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
>  Issue Type: Bug
>Reporter: Manjunath Dhongadi
>Priority: Major
> Attachments: tika-config-no-tess.xml, tika-config.xml
>
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in 
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874
 ] 

Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:46 PM:


Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images 
code in the PDFParser.  With debugging and custom logging, I could see that 
even running multi-threaded, the code works as expected.  If the header says 
no-ocr, pages aren't rendered in the PDFParser and inline images are not 
extracted.

2) In a single thread, I ran all the files in our unit tests with custom 
logging to detect if the TesseractOCRParser was being called on any of the file 
types when the header was set to no_ocr.  I couldn't find any problems.  The 
TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings against all of our test files 10 times. 
The client was single threaded.  I ran pidstat against the forked process, not 
the primary watcher process.  The results all basically look the same to me. 

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:31:47 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:31:47 AM  10002545950.160.000.000.000.17 2  java

11:31:47 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:31:47 AM  1000254595442080 11820 0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:08:39 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:08:39 AM  10002500330.160.000.000.000.17 5  java

11:08:39 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:08:39 AM  1000250033439390 11780 0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:16:50 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:16:50 AM  10002522280.160.000.000.000.17 5  java

11:16:50 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:16:50 AM  1000252228437250 12380 0  java
{noformat}


was (Author: talli...@mitre.org):
Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images 
code in the PDFParser.  With debugging and custom logging, I could see that 
even running multi-threaded, the code works as expected.  If the header says 
no-ocr, pages aren't rendered in the PDFParser and inline images are not 
extracted.

2) In a single thread, I ran all the files in our unit tests with custom 
logging to detect if the TesseractOCRParser was being called on any of the file 
types when the header was set to no_ocr.  I couldn't find any problems.  The 
TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings; the client was single threaded.  I ran 
pidstat against the forked process, not the primary watcher process.  The 
results all basically look the same to me. 

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:31:47 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:31:47 AM  10002545950.160.000.000.000.17 2  java

11:31:47 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:31:47 AM  1000254595442080 11820 0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:08:39 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:08:39 AM  10002500330.160.000.000.000.17 5  java

11:08:39 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:08:39 AM  1000250033439390 11780 0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:16:50 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:16:50 AM  10002522280.160.000.000.000.17 5  java

11:16:50 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:16:50 AM  1000252228437250 12380 0  java
{noformat}

> High CPU utilization in Tika 2.2.0
> --
>
> 

[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500876#comment-17500876
 ] 

Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:42 PM:


I just added the two config files I was using.  This was all with the dev/main 
branch, not with 2.2.0, but I don't think much (related to this) has changed.


was (Author: talli...@mitre.org):
I just added the two config files I was using.

> High CPU utilization in Tika 2.2.0
> --
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
>  Issue Type: Bug
>Reporter: Manjunath Dhongadi
>Priority: Major
> Attachments: tika-config-no-tess.xml, tika-config.xml
>
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in 
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500876#comment-17500876
 ] 

Tim Allison commented on TIKA-3668:
---

I just added the two config files I was using.

> High CPU utilization in Tika 2.2.0
> --
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
>  Issue Type: Bug
>Reporter: Manjunath Dhongadi
>Priority: Major
> Attachments: tika-config-no-tess.xml, tika-config.xml
>
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in 
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3668:
--
Attachment: tika-config.xml
tika-config-no-tess.xml

> High CPU utilization in Tika 2.2.0
> --
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
>  Issue Type: Bug
>Reporter: Manjunath Dhongadi
>Priority: Major
> Attachments: tika-config-no-tess.xml, tika-config.xml
>
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in 
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874
 ] 

Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:40 PM:


Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images 
code in the PDFParser.  With debugging and custom logging, I could see that 
even running multi-threaded, the code works as expected.  If the header says 
no-ocr, pages aren't rendered in the PDFParser and inline images are not 
extracted.

2) In a single thread, I ran all the files in our unit tests with custom 
logging to detect if the TesseractOCRParser was being called on any of the file 
types when the header was set to no_ocr.  I couldn't find any problems.  The 
TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings; the client was single threaded.  I ran 
pidstat against the forked process, not the primary watcher process.  The 
results all basically look the same to me. 

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:31:47 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:31:47 AM  10002545950.160.000.000.000.17 2  java

11:31:47 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:31:47 AM  1000254595442080 11820 0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:08:39 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:08:39 AM  10002500330.160.000.000.000.17 5  java

11:08:39 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:08:39 AM  1000250033439390 11780 0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:16:50 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:16:50 AM  10002522280.160.000.000.000.17 5  java

11:16:50 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:16:50 AM  1000252228437250 12380 0  java
{noformat}


was (Author: talli...@mitre.org):
Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images 
code in the PDFParser.  With debugging and custom logging, I could see that 
even running multi-threaded, the code works as expected.  If the header says 
no-ocr, pages aren't rendered in the PDFParser and inline images are not 
extracted.

2) In a single thread, I ran all the files in our unit tests with custom 
logging to detect if the TesseractOCRParser was being called on any of the file 
types when the header was set to no_ocr.  I couldn't find any problems.  The 
TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings; the client was single threaded.  The 
results all basically look the same to me.  The f

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:31:47 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:31:47 AM  10002545950.160.000.000.000.17 2  java

11:31:47 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:31:47 AM  1000254595442080 11820 0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:08:39 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:08:39 AM  10002500330.160.000.000.000.17 5  java

11:08:39 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:08:39 AM  1000250033439390 11780 0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:16:50 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:16:50 AM  10002522280.160.000.000.000.17 5  java

11:16:50 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:16:50 AM  1000252228437250 12380 0  java
{noformat}

> High CPU utilization in Tika 2.2.0
> --
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
>   

[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500875#comment-17500875
 ] 

Tim Allison commented on TIKA-3668:
---

I'm not denying you're seeing what you're seeing.  I regret that if I can't 
reproduce it locally, I can't fix it.  Am I misunderstanding pidstat output?  
Is there a better way for me to try to reproduce this locally?

> High CPU utilization in Tika 2.2.0
> --
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
>  Issue Type: Bug
>Reporter: Manjunath Dhongadi
>Priority: Major
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in 
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874
 ] 

Tim Allison commented on TIKA-3668:
---

Thank you.  I tried three things this morning.

1) Manually reviewed and re-tested image rendering and extract inline images 
code in the PDFParser.  With debugging and custom logging, I could see that 
even running multi-threaded, the code works as expected.  If the header says 
no-ocr, pages aren't rendered in the PDFParser and inline images are not 
extracted.

2) In a single thread, I ran all the files in our unit tests with custom 
logging to detect if the TesseractOCRParser was being called on any of the file 
types when the header was set to no_ocr.  I couldn't find any problems.  The 
TesseractOCRParser was never called to parse.

3)  I ran pidstat with three settings; the client was single threaded.  The 
results all basically look the same to me.  The f

{noformat}
disable ocr parser via tika-config and do not include "no-ocr header"
~$ pidstat -p 254595 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:31:47 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:31:47 AM  10002545950.160.000.000.000.17 2  java

11:31:47 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:31:47 AM  1000254595442080 11820 0  java

disable ocr parser via tika-config and include "no-ocr header"

~$ pidstat -p 250033 -u -T ALL
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:08:39 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:08:39 AM  10002500330.160.000.000.000.17 5  java

11:08:39 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:08:39 AM  1000250033439390 11780 0  java


disable ocr via header (do not disable tesseract via tika config)
$ pidstat -p 252228 -u -T ALL 
Linux 5.13.0-30-generic ()  03/03/2022  _x86_64_(8 CPU)

11:16:50 AM   UID   PID%usr %system  %guest   %wait%CPU   CPU  
Command
11:16:50 AM  10002522280.160.000.000.000.17 5  java

11:16:50 AM   UID   PIDusr-ms system-ms  guest-ms  Command
11:16:50 AM  1000252228437250 12380 0  java
{noformat}

> High CPU utilization in Tika 2.2.0
> --
>
> Key: TIKA-3668
> URL: https://issues.apache.org/jira/browse/TIKA-3668
> Project: Tika
>  Issue Type: Bug
>Reporter: Manjunath Dhongadi
>Priority: Major
>
> Recently we upgraded Tika version from 1.26 to 2.2.0.
> We see the CPU utilization have gone high drastically(6 to 8 times more) in 
> both cases Tesseract enabled and Tesseract disabled case.
> We are using tika-parsers-standard-package of 2.2.0.
> Whether this is normal behavior of high version of Tika 2.2.0. 
> Any fine tuning parameters available for same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Vincent Massol (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500814#comment-17500814
 ] 

Vincent Massol edited comment on TIKA-3686 at 3/3/22, 2:57 PM:
---

[~nick] Thanks. Note that it seems to be some sort of regression since it 
worked fine before the upgrade to Tika 2.0.0. Was this change of behavior 
wanted?

Details at https://jira.xwiki.org/browse/XWIKI-19491


was (Author: vmassol):
[~nick] Thanks. Note that it seems to be some sort of regression since it 
worked fine before the upgrade to Tika 2.0.0. Was this change of behavior 
wanted?

> CSS file detected as JavaScript (application/javascript)
> 
>
> Key: TIKA-3686
> URL: https://issues.apache.org/jira/browse/TIKA-3686
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.0.0-ALPHA
>Reporter: Marius Dumitru Florea
>Priority: Major
>
> The following CSS file 
> [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css]
>  is detected as {{application/javascript}} using:
> {noformat}
> TikaUtils.detect(InputStream stream, String name)
> {noformat}
> The reason seems to be that the CSS file starts with:
> {noformat}
> /*!
>  * jQuery
> {noformat}
> which matches the "jQuery" entry from 
> [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348]
>  used by Tika's {{MimeTypes}} detector.
> This is a regression introduced by 
> https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7
>  in TIKA-1141 (2.0.0-ALPHA).
> The implications are serious if the mime type returned by Tika is used to set 
> the content type on the HTTP request returning the CSS file to the browser: 
> the browser ignores the CSS.
> FTR, in my case the CSS file is not served directly from the file system but 
> from a WebJar (in this case 
> https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and 
> we're using Tika to determine the type of files requested from the WebJars.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Vincent Massol (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500814#comment-17500814
 ] 

Vincent Massol commented on TIKA-3686:
--

[~nick] Thanks. Note that it seems to be some sort of regression since it 
worked fine before the upgrade to Tika 2.0.0. Was this change of behavior 
wanted?

> CSS file detected as JavaScript (application/javascript)
> 
>
> Key: TIKA-3686
> URL: https://issues.apache.org/jira/browse/TIKA-3686
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.0.0-ALPHA
>Reporter: Marius Dumitru Florea
>Priority: Major
>
> The following CSS file 
> [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css]
>  is detected as {{application/javascript}} using:
> {noformat}
> TikaUtils.detect(InputStream stream, String name)
> {noformat}
> The reason seems to be that the CSS file starts with:
> {noformat}
> /*!
>  * jQuery
> {noformat}
> which matches the "jQuery" entry from 
> [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348]
>  used by Tika's {{MimeTypes}} detector.
> This is a regression introduced by 
> https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7
>  in TIKA-1141 (2.0.0-ALPHA).
> The implications are serious if the mime type returned by Tika is used to set 
> the content type on the HTTP request returning the CSS file to the browser: 
> the browser ignores the CSS.
> FTR, in my case the CSS file is not served directly from the file system but 
> from a WebJar (in this case 
> https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and 
> we're using Tika to determine the type of files requested from the WebJars.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500804#comment-17500804
 ] 

Nick Burch commented on TIKA-3686:
--

Detecting types of text-based files with magic is always going to fail for some 
cases. There are no sure-fire things to match on, only guesses

If you're sure that your files have the right extensions on them, just ask Tika 
to detect by filename only, no contents

> CSS file detected as JavaScript (application/javascript)
> 
>
> Key: TIKA-3686
> URL: https://issues.apache.org/jira/browse/TIKA-3686
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.0.0-ALPHA
>Reporter: Marius Dumitru Florea
>Priority: Major
>
> The following CSS file 
> [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css]
>  is detected as {{application/javascript}} using:
> {noformat}
> TikaUtils.detect(InputStream stream, String name)
> {noformat}
> The reason seems to be that the CSS file starts with:
> {noformat}
> /*!
>  * jQuery
> {noformat}
> which matches the "jQuery" entry from 
> [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348]
>  used by Tika's {{MimeTypes}} detector.
> This is a regression introduced by 
> https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7
>  in TIKA-1141 (2.0.0-ALPHA).
> The implications are serious if the mime type returned by Tika is used to set 
> the content type on the HTTP request returning the CSS file to the browser: 
> the browser ignores the CSS.
> FTR, in my case the CSS file is not served directly from the file system but 
> from a WebJar (in this case 
> https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and 
> we're using Tika to determine the type of files requested from the WebJars.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3682) PDFParser is extracting each char of a word in a new line

2022-03-03 Thread Sree Harsha (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500792#comment-17500792
 ] 

Sree Harsha commented on TIKA-3682:
---

Apologies for delayed response..

Creating a sample file and will upload in couple of days...

> PDFParser is extracting each char of a word in a new line
> -
>
> Key: TIKA-3682
> URL: https://issues.apache.org/jira/browse/TIKA-3682
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.26, 2.3.0
>Reporter: Sree Harsha
>Priority: Major
> Attachments: image-2022-02-22-13-14-14-067.png
>
>
> when pdf parser is trying to extract text from a pdf document having a 
> different orientation for text, each character of word is extracted to a  new 
> line.
> For eg the text is extracted like below:
> TO
>  P
> LA
> C
> E
> A
> N
>  O
> R
> D
> E
> R
> where the original text is like 
> !image-2022-02-22-13-14-14-067.png!
> setExtractBookmarksText(false);
> getPDFParserConfig().setEnableAutoSpace(true);
>  
> After adding the below options:
> setSortByPosition(true);
> setSuppressDuplicateOverlappingText(true);
> setOcrStrategy(OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
>  
> The text is extracted like:
> TO PLACE xxx
> yyy AN ORDER
>  
> where xx, yyy refers to some other text at same level in pdf document.
> If i search for TO PLACE AN ORDER in acrobat reader it works but if i search 
> for the same text in extracted text content, it won't work..
> Is there any option to exclude unnecessary new line characters shown in first 
> example and also solve the side effect or sort by position issue..
> The the output should look like:
> TO PLACE AN ORDER
> xx 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Marius Dumitru Florea (Jira)
Marius Dumitru Florea created TIKA-3686:
---

 Summary: CSS file detected as JavaScript (application/javascript)
 Key: TIKA-3686
 URL: https://issues.apache.org/jira/browse/TIKA-3686
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 2.0.0-ALPHA
Reporter: Marius Dumitru Florea


The following CSS file 
[https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css]
 is detected as {{application/javascript}} using:

{noformat}
TikaUtils.detect(InputStream stream, String name)
{noformat}

The reason seems to be that the CSS file starts with:

{noformat}
/*!
 * jQuery
{noformat}

which matches the "jQuery" entry from 
[tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348]
 used by Tika's {{MimeTypes}} detector.

This is a regression introduced by 
https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7 
in TIKA-1141 (2.0.0-ALPHA).

The implications are serious if the mime type returned by Tika is used to set 
the content type on the HTTP request returning the CSS file to the browser: the 
browser ignores the CSS.

FTR, in my case the CSS file is not served directly from the file system but 
from a WebJar (in this case 
https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and 
we're using Tika to determine the type of files requested from the WebJars.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3684) Extract text returns the text multiple times

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500173#comment-17500173
 ] 

Tim Allison edited comment on TIKA-3684 at 3/3/22, 12:42 PM:
-

I attached an example for turning off the WMFParser and the EMFParser. 


was (Author: talli...@mitre.org):
I attached an example for turning off the WMFParser and the EMFParser. When 
calling tika-server with docker, add {{--config tika-config-no-xmf.xml}}

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3684) Extract text returns the text multiple times

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500173#comment-17500173
 ] 

Tim Allison edited comment on TIKA-3684 at 3/3/22, 12:41 PM:
-

I attached an example for turning off the WMFParser and the EMFParser. When 
calling tika-server with docker, add {{--config tika-config-no-xmf.xml}}


was (Author: talli...@mitre.org):
I attached an example for turning off the WMFParser and the EMFParser. When 
calling tika-server in docker, add {{-c tika-config-no-xmf.xml}}

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500720#comment-17500720
 ] 

Tim Allison commented on TIKA-3684:
---

Oops. Thank you!

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [tika] tballison merged pull request #519: Bump commons-net from 3.7.2 to 3.8.0

2022-03-03 Thread GitBox


tballison merged pull request #519:
URL: https://github.com/apache/tika/pull/519


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [tika] tballison merged pull request #517: Bump bndlib from 1.50.0 to 2.0.0.20130123-133441

2022-03-03 Thread GitBox


tballison merged pull request #517:
URL: https://github.com/apache/tika/pull/517


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [tika] tballison merged pull request #518: Bump build-helper-maven-plugin from 3.0.0 to 3.3.0

2022-03-03 Thread GitBox


tballison merged pull request #518:
URL: https://github.com/apache/tika/pull/518


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org