[GitHub] [tika] dependabot[bot] opened a new pull request #522: Bump twelvemonkeys.version from 3.8.1 to 3.8.2
dependabot[bot] opened a new pull request #522: URL: https://github.com/apache/tika/pull/522 Bumps `twelvemonkeys.version` from 3.8.1 to 3.8.2. Updates `common-io` from 3.8.1 to 3.8.2 Updates `imageio-bmp` from 3.8.1 to 3.8.2 Updates `imageio-jpeg` from 3.8.1 to 3.8.2 Updates `imageio-psd` from 3.8.1 to 3.8.2 Updates `imageio-tiff` from 3.8.1 to 3.8.2 Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] dependabot[bot] opened a new pull request #521: Bump maven-javadoc-plugin from 3.3.1 to 3.3.2
dependabot[bot] opened a new pull request #521: URL: https://github.com/apache/tika/pull/521 Bumps [maven-javadoc-plugin](https://github.com/apache/maven-javadoc-plugin) from 3.3.1 to 3.3.2. Commits https://github.com/apache/maven-javadoc-plugin/commit/50a41f78278c1d5957cb03e89223fcbc640f1b68";>50a41f7 [maven-release-plugin] prepare release maven-javadoc-plugin-3.3.2 https://github.com/apache/maven-javadoc-plugin/commit/5af4519ce17308f35d4cfb553d0af8168cac35f4";>5af4519 [MJAVADOC-705] Upgrade Maven Reporting API to 3.1.0 https://github.com/apache/maven-javadoc-plugin/commit/ee4132f9d0f6f412784e108305524cf3b3a3009a";>ee4132f [MJAVADOC-704] Javadoc plugin does not respect jdkToolchain https://github.com/apache/maven-javadoc-plugin/commit/651b98e6951ee2e3d8fefa1bcb3629f1dae763be";>651b98e Bump doxia-site-renderer from 1.10 to 1.11.1 https://github.com/apache/maven-javadoc-plugin/commit/db20fddd4eb443711948645613c03a4ccc516dab";>db20fdd Bump plexus-archiver from 4.2.5 to 4.2.6 https://github.com/apache/maven-javadoc-plugin/commit/b51c5d801c1e329b1e59e7b66142c8fd9c0ba6af";>b51c5d8 [MJAVADOC-694] Avoid empty warn message from getResolvePathResult https://github.com/apache/maven-javadoc-plugin/commit/6b1515ed43826417cf4cbad88be1107e92cba3f5";>6b1515e Bump httpcore from 4.4.14 to 4.4.15 https://github.com/apache/maven-javadoc-plugin/commit/a4aa7dcdc22eda19a08cc3c3e231631d422f1725";>a4aa7dc (doc) fix javadoc issues https://github.com/apache/maven-javadoc-plugin/commit/3309cc2ea03237576f3d702cffe1d628c3ea0d12";>3309cc2 Bump maven-project-info-reports-plugin to 3.1.2 https://github.com/apache/maven-javadoc-plugin/commit/51a2c3f803a0124c2591e46925210da705042f19";>51a2c3f Bump maven-javadoc-plugin to 3.3.1 Additional commits viewable in https://github.com/apache/maven-javadoc-plugin/compare/maven-javadoc-plugin-3.3.1...maven-javadoc-plugin-3.3.2";>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.maven.plugins:maven-javadoc-plugin&package-manager=maven&previous-version=3.3.1&new-version=3.3.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501040#comment-17501040 ] Hudson commented on TIKA-3687: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #475 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/475/]) [TIKA-3687] Fix email detection (#520) (github: [https://github.com/apache/tika/commit/5444f80d1b71845ff47c91376f5c90a40dae5a4f]) * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/test/resources/test-documents/testRFC822-ARC * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Fix For: 2.3.1 > > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC * headers, but they're not the first one, so > the matcher that looks for ARC headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500990#comment-17500990 ] Tim Allison edited comment on TIKA-3687 at 3/3/22, 7:13 PM: Thank you [~tguerin]! was (Author: talli...@mitre.org): Thank you! > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Fix For: 2.3.1 > > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC * headers, but they're not the first one, so > the matcher that looks for ARC headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3687. --- Fix Version/s: 2.3.1 Resolution: Fixed Thank you! > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Fix For: 2.3.1 > > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC * headers, but they're not the first one, so > the matcher that looks for ARC headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500986#comment-17500986 ] ASF GitHub Bot commented on TIKA-3687: -- tballison merged pull request #520: URL: https://github.com/apache/tika/pull/520 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC * headers, but they're not the first one, so > the matcher that looks for ARC headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [tika] tballison merged pull request #520: Fix email detection (TIKA-3687)
tballison merged pull request #520: URL: https://github.com/apache/tika/pull/520 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500974#comment-17500974 ] ASF GitHub Bot commented on TIKA-3687: -- SchwingSK commented on a change in pull request #520: URL: https://github.com/apache/tika/pull/520#discussion_r818959865 ## File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml ## @@ -6422,7 +6422,7 @@ - + Review comment: Good point, I like your solution better. Code changed accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC * headers, but they're not the first one, so > the matcher that looks for ARC headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [tika] SchwingSK commented on a change in pull request #520: Fix email detection (TIKA-3687)
SchwingSK commented on a change in pull request #520: URL: https://github.com/apache/tika/pull/520#discussion_r818959865 ## File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml ## @@ -6422,7 +6422,7 @@ - + Review comment: Good point, I like your solution better. Code changed accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500962#comment-17500962 ] Thierry Guérin edited comment on TIKA-3687 at 3/3/22, 6:22 PM: --- Created a pull request: [https://github.com/apache/tika/pull/520.] I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. Other solution was to increase 1024 to at least 8000 (I have another email in which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone here has a good idea on which version is the most efficient. As of now, I only found examples where there was one 'Received:' header before the 'ARC*' headers, that's why I think that 1024 may be overkill. was (Author: tguerin): Created a pull request: [https://github.com/apache/tika/pull/520.] I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. Other solution was to increase 1024 to at least 8000 (I have another email in which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone here has a good idea on which version is the most efficient. > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC * headers, but they're not the first one, so > the matcher that looks for ARC headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500966#comment-17500966 ] ASF GitHub Bot commented on TIKA-3687: -- tballison commented on a change in pull request #520: URL: https://github.com/apache/tika/pull/520#discussion_r818943762 ## File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml ## @@ -6422,7 +6422,7 @@ - + Review comment: I worry about looking for X- anywhere in the first 1024 without requiring a \n before it. What would you think of adding something like this into the previous minShouldMatch=2 clause? ` ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC * headers, but they're not the first one, so > the matcher that looks for ARC headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [tika] tballison commented on a change in pull request #520: Fix email detection (TIKA-3687)
tballison commented on a change in pull request #520: URL: https://github.com/apache/tika/pull/520#discussion_r818943762 ## File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml ## @@ -6422,7 +6422,7 @@ - + Review comment: I worry about looking for X- anywhere in the first 1024 without requiring a \n before it. What would you think of adding something like this into the previous minShouldMatch=2 clause? ` ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500962#comment-17500962 ] Thierry Guérin commented on TIKA-3687: -- Created a pull request: [https://github.com/apache/tika/pull/520.] I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. Other solution was to increase 1024 to at least 8000 (I have another email in which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone here has a good idea on which version is the most efficient. > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC -* headers, but they're not the first one, so > the matcher that looks for ARC- headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thierry Guérin updated TIKA-3687: - Description: The attached email (which I redacted from a real email received from Office365) is detected a HTML. This is because it contains ARC * headers, but they're not the first one, so the matcher that looks for ARC headers fails, and the matcher for regular 'From' header also fails because the 'From' headers occurs after 1024 characters. was: The attached email (which I redacted from a real email received from Office365) is detected a HTML. This is because it contains ARC -* headers, but they're not the first one, so the matcher that looks for ARC- headers fails, and the matcher for regular 'From' header also fails because the 'From' headers occurs after 1024 characters. > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC * headers, but they're not the first one, so > the matcher that looks for ARC headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3687) Email file detected as text/html
[ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500953#comment-17500953 ] ASF GitHub Bot commented on TIKA-3687: -- SchwingSK opened a new pull request #520: URL: https://github.com/apache/tika/pull/520 1024 is maybe a bit overkill for the X|DKIM|ARC headers lookahead ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Email file detected as text/html > > > Key: TIKA-3687 > URL: https://issues.apache.org/jira/browse/TIKA-3687 > Project: Tika > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Thierry Guérin >Priority: Minor > Attachments: testRFC822-ARC.eml > > > The attached email (which I redacted from a real email received from > Office365) is detected a HTML. > This is because it contains ARC -* headers, but they're not the first one, so > the matcher that looks for ARC- headers fails, and the matcher for regular > 'From' header also fails because the 'From' headers occurs after 1024 > characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [tika] SchwingSK opened a new pull request #520: Fix email detection (TIKA-3687)
SchwingSK opened a new pull request #520: URL: https://github.com/apache/tika/pull/520 1024 is maybe a bit overkill for the X|DKIM|ARC headers lookahead ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (TIKA-3687) Email file detected as text/html
Thierry Guérin created TIKA-3687: Summary: Email file detected as text/html Key: TIKA-3687 URL: https://issues.apache.org/jira/browse/TIKA-3687 Project: Tika Issue Type: Bug Affects Versions: 2.3.0 Reporter: Thierry Guérin Attachments: testRFC822-ARC.eml The attached email (which I redacted from a real email received from Office365) is detected a HTML. This is because it contains ARC -* headers, but they're not the first one, so the matcher that looks for ARC- headers fails, and the matcher for regular 'From' header also fails because the 'From' headers occurs after 1024 characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500904#comment-17500904 ] Tim Allison commented on TIKA-3668: --- When I ran the same test against 1.26 with tesseract removed via tika config, this is a bit faster and a bit better on cpu. {noformat} ~$ pidstat -p 261234 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 12:01:10 PM UID PID%usr %system %guest %wait%CPU CPU Command 12:01:10 PM 10002612340.140.000.000.000.14 5 java 12:01:10 PM UID PIDusr-ms system-ms guest-ms Command 12:01:10 PM 1000261234396490 12570 0 java {noformat} > High CPU utilization in Tika 2.2.0 > -- > > Key: TIKA-3668 > URL: https://issues.apache.org/jira/browse/TIKA-3668 > Project: Tika > Issue Type: Bug >Reporter: Manjunath Dhongadi >Priority: Major > Attachments: tika-config-no-tess.xml, tika-config.xml > > > Recently we upgraded Tika version from 1.26 to 2.2.0. > We see the CPU utilization have gone high drastically(6 to 8 times more) in > both cases Tesseract enabled and Tesseract disabled case. > We are using tika-parsers-standard-package of 2.2.0. > Whether this is normal behavior of high version of Tika 2.2.0. > Any fine tuning parameters available for same. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874 ] Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:46 PM: Thank you. I tried three things this morning. 1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser. With debugging and custom logging, I could see that even running multi-threaded, the code works as expected. If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted. 2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr. I couldn't find any problems. The TesseractOCRParser was never called to parse. 3) I ran pidstat with three settings against all of our test files 10 times. The client was single threaded. I ran pidstat against the forked process, not the primary watcher process. The results all basically look the same to me. {noformat} disable ocr parser via tika-config and do not include "no-ocr header" ~$ pidstat -p 254595 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:31:47 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:31:47 AM 10002545950.160.000.000.000.17 2 java 11:31:47 AM UID PIDusr-ms system-ms guest-ms Command 11:31:47 AM 1000254595442080 11820 0 java disable ocr parser via tika-config and include "no-ocr header" ~$ pidstat -p 250033 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:08:39 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:08:39 AM 10002500330.160.000.000.000.17 5 java 11:08:39 AM UID PIDusr-ms system-ms guest-ms Command 11:08:39 AM 1000250033439390 11780 0 java disable ocr via header (do not disable tesseract via tika config) $ pidstat -p 252228 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:16:50 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:16:50 AM 10002522280.160.000.000.000.17 5 java 11:16:50 AM UID PIDusr-ms system-ms guest-ms Command 11:16:50 AM 1000252228437250 12380 0 java {noformat} was (Author: talli...@mitre.org): Thank you. I tried three things this morning. 1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser. With debugging and custom logging, I could see that even running multi-threaded, the code works as expected. If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted. 2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr. I couldn't find any problems. The TesseractOCRParser was never called to parse. 3) I ran pidstat with three settings; the client was single threaded. I ran pidstat against the forked process, not the primary watcher process. The results all basically look the same to me. {noformat} disable ocr parser via tika-config and do not include "no-ocr header" ~$ pidstat -p 254595 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:31:47 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:31:47 AM 10002545950.160.000.000.000.17 2 java 11:31:47 AM UID PIDusr-ms system-ms guest-ms Command 11:31:47 AM 1000254595442080 11820 0 java disable ocr parser via tika-config and include "no-ocr header" ~$ pidstat -p 250033 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:08:39 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:08:39 AM 10002500330.160.000.000.000.17 5 java 11:08:39 AM UID PIDusr-ms system-ms guest-ms Command 11:08:39 AM 1000250033439390 11780 0 java disable ocr via header (do not disable tesseract via tika config) $ pidstat -p 252228 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:16:50 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:16:50 AM 10002522280.160.000.000.000.17 5 java 11:16:50 AM UID PIDusr-ms system-ms guest-ms Command 11:16:50 AM 1000252228437250 12380 0 java {noformat} > High CPU utilization in Tika 2.2.0 > -- > >
[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500876#comment-17500876 ] Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:42 PM: I just added the two config files I was using. This was all with the dev/main branch, not with 2.2.0, but I don't think much (related to this) has changed. was (Author: talli...@mitre.org): I just added the two config files I was using. > High CPU utilization in Tika 2.2.0 > -- > > Key: TIKA-3668 > URL: https://issues.apache.org/jira/browse/TIKA-3668 > Project: Tika > Issue Type: Bug >Reporter: Manjunath Dhongadi >Priority: Major > Attachments: tika-config-no-tess.xml, tika-config.xml > > > Recently we upgraded Tika version from 1.26 to 2.2.0. > We see the CPU utilization have gone high drastically(6 to 8 times more) in > both cases Tesseract enabled and Tesseract disabled case. > We are using tika-parsers-standard-package of 2.2.0. > Whether this is normal behavior of high version of Tika 2.2.0. > Any fine tuning parameters available for same. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500876#comment-17500876 ] Tim Allison commented on TIKA-3668: --- I just added the two config files I was using. > High CPU utilization in Tika 2.2.0 > -- > > Key: TIKA-3668 > URL: https://issues.apache.org/jira/browse/TIKA-3668 > Project: Tika > Issue Type: Bug >Reporter: Manjunath Dhongadi >Priority: Major > Attachments: tika-config-no-tess.xml, tika-config.xml > > > Recently we upgraded Tika version from 1.26 to 2.2.0. > We see the CPU utilization have gone high drastically(6 to 8 times more) in > both cases Tesseract enabled and Tesseract disabled case. > We are using tika-parsers-standard-package of 2.2.0. > Whether this is normal behavior of high version of Tika 2.2.0. > Any fine tuning parameters available for same. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (TIKA-3668) High CPU utilization in Tika 2.2.0
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3668: -- Attachment: tika-config.xml tika-config-no-tess.xml > High CPU utilization in Tika 2.2.0 > -- > > Key: TIKA-3668 > URL: https://issues.apache.org/jira/browse/TIKA-3668 > Project: Tika > Issue Type: Bug >Reporter: Manjunath Dhongadi >Priority: Major > Attachments: tika-config-no-tess.xml, tika-config.xml > > > Recently we upgraded Tika version from 1.26 to 2.2.0. > We see the CPU utilization have gone high drastically(6 to 8 times more) in > both cases Tesseract enabled and Tesseract disabled case. > We are using tika-parsers-standard-package of 2.2.0. > Whether this is normal behavior of high version of Tika 2.2.0. > Any fine tuning parameters available for same. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3668) High CPU utilization in Tika 2.2.0
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874 ] Tim Allison edited comment on TIKA-3668 at 3/3/22, 4:40 PM: Thank you. I tried three things this morning. 1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser. With debugging and custom logging, I could see that even running multi-threaded, the code works as expected. If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted. 2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr. I couldn't find any problems. The TesseractOCRParser was never called to parse. 3) I ran pidstat with three settings; the client was single threaded. I ran pidstat against the forked process, not the primary watcher process. The results all basically look the same to me. {noformat} disable ocr parser via tika-config and do not include "no-ocr header" ~$ pidstat -p 254595 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:31:47 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:31:47 AM 10002545950.160.000.000.000.17 2 java 11:31:47 AM UID PIDusr-ms system-ms guest-ms Command 11:31:47 AM 1000254595442080 11820 0 java disable ocr parser via tika-config and include "no-ocr header" ~$ pidstat -p 250033 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:08:39 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:08:39 AM 10002500330.160.000.000.000.17 5 java 11:08:39 AM UID PIDusr-ms system-ms guest-ms Command 11:08:39 AM 1000250033439390 11780 0 java disable ocr via header (do not disable tesseract via tika config) $ pidstat -p 252228 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:16:50 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:16:50 AM 10002522280.160.000.000.000.17 5 java 11:16:50 AM UID PIDusr-ms system-ms guest-ms Command 11:16:50 AM 1000252228437250 12380 0 java {noformat} was (Author: talli...@mitre.org): Thank you. I tried three things this morning. 1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser. With debugging and custom logging, I could see that even running multi-threaded, the code works as expected. If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted. 2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr. I couldn't find any problems. The TesseractOCRParser was never called to parse. 3) I ran pidstat with three settings; the client was single threaded. The results all basically look the same to me. The f {noformat} disable ocr parser via tika-config and do not include "no-ocr header" ~$ pidstat -p 254595 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:31:47 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:31:47 AM 10002545950.160.000.000.000.17 2 java 11:31:47 AM UID PIDusr-ms system-ms guest-ms Command 11:31:47 AM 1000254595442080 11820 0 java disable ocr parser via tika-config and include "no-ocr header" ~$ pidstat -p 250033 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:08:39 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:08:39 AM 10002500330.160.000.000.000.17 5 java 11:08:39 AM UID PIDusr-ms system-ms guest-ms Command 11:08:39 AM 1000250033439390 11780 0 java disable ocr via header (do not disable tesseract via tika config) $ pidstat -p 252228 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:16:50 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:16:50 AM 10002522280.160.000.000.000.17 5 java 11:16:50 AM UID PIDusr-ms system-ms guest-ms Command 11:16:50 AM 1000252228437250 12380 0 java {noformat} > High CPU utilization in Tika 2.2.0 > -- > > Key: TIKA-3668 > URL: https://issues.apache.org/jira/browse/TIKA-3668 >
[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500875#comment-17500875 ] Tim Allison commented on TIKA-3668: --- I'm not denying you're seeing what you're seeing. I regret that if I can't reproduce it locally, I can't fix it. Am I misunderstanding pidstat output? Is there a better way for me to try to reproduce this locally? > High CPU utilization in Tika 2.2.0 > -- > > Key: TIKA-3668 > URL: https://issues.apache.org/jira/browse/TIKA-3668 > Project: Tika > Issue Type: Bug >Reporter: Manjunath Dhongadi >Priority: Major > > Recently we upgraded Tika version from 1.26 to 2.2.0. > We see the CPU utilization have gone high drastically(6 to 8 times more) in > both cases Tesseract enabled and Tesseract disabled case. > We are using tika-parsers-standard-package of 2.2.0. > Whether this is normal behavior of high version of Tika 2.2.0. > Any fine tuning parameters available for same. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3668) High CPU utilization in Tika 2.2.0
[ https://issues.apache.org/jira/browse/TIKA-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500874#comment-17500874 ] Tim Allison commented on TIKA-3668: --- Thank you. I tried three things this morning. 1) Manually reviewed and re-tested image rendering and extract inline images code in the PDFParser. With debugging and custom logging, I could see that even running multi-threaded, the code works as expected. If the header says no-ocr, pages aren't rendered in the PDFParser and inline images are not extracted. 2) In a single thread, I ran all the files in our unit tests with custom logging to detect if the TesseractOCRParser was being called on any of the file types when the header was set to no_ocr. I couldn't find any problems. The TesseractOCRParser was never called to parse. 3) I ran pidstat with three settings; the client was single threaded. The results all basically look the same to me. The f {noformat} disable ocr parser via tika-config and do not include "no-ocr header" ~$ pidstat -p 254595 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:31:47 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:31:47 AM 10002545950.160.000.000.000.17 2 java 11:31:47 AM UID PIDusr-ms system-ms guest-ms Command 11:31:47 AM 1000254595442080 11820 0 java disable ocr parser via tika-config and include "no-ocr header" ~$ pidstat -p 250033 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:08:39 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:08:39 AM 10002500330.160.000.000.000.17 5 java 11:08:39 AM UID PIDusr-ms system-ms guest-ms Command 11:08:39 AM 1000250033439390 11780 0 java disable ocr via header (do not disable tesseract via tika config) $ pidstat -p 252228 -u -T ALL Linux 5.13.0-30-generic () 03/03/2022 _x86_64_(8 CPU) 11:16:50 AM UID PID%usr %system %guest %wait%CPU CPU Command 11:16:50 AM 10002522280.160.000.000.000.17 5 java 11:16:50 AM UID PIDusr-ms system-ms guest-ms Command 11:16:50 AM 1000252228437250 12380 0 java {noformat} > High CPU utilization in Tika 2.2.0 > -- > > Key: TIKA-3668 > URL: https://issues.apache.org/jira/browse/TIKA-3668 > Project: Tika > Issue Type: Bug >Reporter: Manjunath Dhongadi >Priority: Major > > Recently we upgraded Tika version from 1.26 to 2.2.0. > We see the CPU utilization have gone high drastically(6 to 8 times more) in > both cases Tesseract enabled and Tesseract disabled case. > We are using tika-parsers-standard-package of 2.2.0. > Whether this is normal behavior of high version of Tika 2.2.0. > Any fine tuning parameters available for same. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3686) CSS file detected as JavaScript (application/javascript)
[ https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500814#comment-17500814 ] Vincent Massol edited comment on TIKA-3686 at 3/3/22, 2:57 PM: --- [~nick] Thanks. Note that it seems to be some sort of regression since it worked fine before the upgrade to Tika 2.0.0. Was this change of behavior wanted? Details at https://jira.xwiki.org/browse/XWIKI-19491 was (Author: vmassol): [~nick] Thanks. Note that it seems to be some sort of regression since it worked fine before the upgrade to Tika 2.0.0. Was this change of behavior wanted? > CSS file detected as JavaScript (application/javascript) > > > Key: TIKA-3686 > URL: https://issues.apache.org/jira/browse/TIKA-3686 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.0.0-ALPHA >Reporter: Marius Dumitru Florea >Priority: Major > > The following CSS file > [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css] > is detected as {{application/javascript}} using: > {noformat} > TikaUtils.detect(InputStream stream, String name) > {noformat} > The reason seems to be that the CSS file starts with: > {noformat} > /*! > * jQuery > {noformat} > which matches the "jQuery" entry from > [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348] > used by Tika's {{MimeTypes}} detector. > This is a regression introduced by > https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7 > in TIKA-1141 (2.0.0-ALPHA). > The implications are serious if the mime type returned by Tika is used to set > the content type on the HTTP request returning the CSS file to the browser: > the browser ignores the CSS. > FTR, in my case the CSS file is not served directly from the file system but > from a WebJar (in this case > https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and > we're using Tika to determine the type of files requested from the WebJars. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)
[ https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500814#comment-17500814 ] Vincent Massol commented on TIKA-3686: -- [~nick] Thanks. Note that it seems to be some sort of regression since it worked fine before the upgrade to Tika 2.0.0. Was this change of behavior wanted? > CSS file detected as JavaScript (application/javascript) > > > Key: TIKA-3686 > URL: https://issues.apache.org/jira/browse/TIKA-3686 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.0.0-ALPHA >Reporter: Marius Dumitru Florea >Priority: Major > > The following CSS file > [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css] > is detected as {{application/javascript}} using: > {noformat} > TikaUtils.detect(InputStream stream, String name) > {noformat} > The reason seems to be that the CSS file starts with: > {noformat} > /*! > * jQuery > {noformat} > which matches the "jQuery" entry from > [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348] > used by Tika's {{MimeTypes}} detector. > This is a regression introduced by > https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7 > in TIKA-1141 (2.0.0-ALPHA). > The implications are serious if the mime type returned by Tika is used to set > the content type on the HTTP request returning the CSS file to the browser: > the browser ignores the CSS. > FTR, in my case the CSS file is not served directly from the file system but > from a WebJar (in this case > https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and > we're using Tika to determine the type of files requested from the WebJars. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)
[ https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500804#comment-17500804 ] Nick Burch commented on TIKA-3686: -- Detecting types of text-based files with magic is always going to fail for some cases. There are no sure-fire things to match on, only guesses If you're sure that your files have the right extensions on them, just ask Tika to detect by filename only, no contents > CSS file detected as JavaScript (application/javascript) > > > Key: TIKA-3686 > URL: https://issues.apache.org/jira/browse/TIKA-3686 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.0.0-ALPHA >Reporter: Marius Dumitru Florea >Priority: Major > > The following CSS file > [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css] > is detected as {{application/javascript}} using: > {noformat} > TikaUtils.detect(InputStream stream, String name) > {noformat} > The reason seems to be that the CSS file starts with: > {noformat} > /*! > * jQuery > {noformat} > which matches the "jQuery" entry from > [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348] > used by Tika's {{MimeTypes}} detector. > This is a regression introduced by > https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7 > in TIKA-1141 (2.0.0-ALPHA). > The implications are serious if the mime type returned by Tika is used to set > the content type on the HTTP request returning the CSS file to the browser: > the browser ignores the CSS. > FTR, in my case the CSS file is not served directly from the file system but > from a WebJar (in this case > https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and > we're using Tika to determine the type of files requested from the WebJars. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3682) PDFParser is extracting each char of a word in a new line
[ https://issues.apache.org/jira/browse/TIKA-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500792#comment-17500792 ] Sree Harsha commented on TIKA-3682: --- Apologies for delayed response.. Creating a sample file and will upload in couple of days... > PDFParser is extracting each char of a word in a new line > - > > Key: TIKA-3682 > URL: https://issues.apache.org/jira/browse/TIKA-3682 > Project: Tika > Issue Type: Bug >Affects Versions: 1.26, 2.3.0 >Reporter: Sree Harsha >Priority: Major > Attachments: image-2022-02-22-13-14-14-067.png > > > when pdf parser is trying to extract text from a pdf document having a > different orientation for text, each character of word is extracted to a new > line. > For eg the text is extracted like below: > TO > P > LA > C > E > A > N > O > R > D > E > R > where the original text is like > !image-2022-02-22-13-14-14-067.png! > setExtractBookmarksText(false); > getPDFParserConfig().setEnableAutoSpace(true); > > After adding the below options: > setSortByPosition(true); > setSuppressDuplicateOverlappingText(true); > setOcrStrategy(OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION); > > The text is extracted like: > TO PLACE xxx > yyy AN ORDER > > where xx, yyy refers to some other text at same level in pdf document. > If i search for TO PLACE AN ORDER in acrobat reader it works but if i search > for the same text in extracted text content, it won't work.. > Is there any option to exclude unnecessary new line characters shown in first > example and also solve the side effect or sort by position issue.. > The the output should look like: > TO PLACE AN ORDER > xx -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (TIKA-3686) CSS file detected as JavaScript (application/javascript)
Marius Dumitru Florea created TIKA-3686: --- Summary: CSS file detected as JavaScript (application/javascript) Key: TIKA-3686 URL: https://issues.apache.org/jira/browse/TIKA-3686 Project: Tika Issue Type: Bug Components: detector Affects Versions: 2.0.0-ALPHA Reporter: Marius Dumitru Florea The following CSS file [https://github.com/techlab/jquery-smartwizard/blob/v5.1.1/dist/css/smart_wizard_all.min.css] is detected as {{application/javascript}} using: {noformat} TikaUtils.detect(InputStream stream, String name) {noformat} The reason seems to be that the CSS file starts with: {noformat} /*! * jQuery {noformat} which matches the "jQuery" entry from [tika-mimetypes.xml|https://github.com/apache/tika/blob/2.3.0/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L348] used by Tika's {{MimeTypes}} detector. This is a regression introduced by https://github.com/apache/tika/commit/97699598f000139b1222b785d634b3c8a8e216c7 in TIKA-1141 (2.0.0-ALPHA). The implications are serious if the mime type returned by Tika is used to set the content type on the HTTP request returning the CSS file to the browser: the browser ignores the CSS. FTR, in my case the CSS file is not served directly from the file system but from a WebJar (in this case https://search.maven.org/artifact/org.webjars.npm/smartwizard/5.1.1/jar ) and we're using Tika to determine the type of files requested from the WebJars. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500173#comment-17500173 ] Tim Allison edited comment on TIKA-3684 at 3/3/22, 12:42 PM: - I attached an example for turning off the WMFParser and the EMFParser. was (Author: talli...@mitre.org): I attached an example for turning off the WMFParser and the EMFParser. When calling tika-server with docker, add {{--config tika-config-no-xmf.xml}} > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500173#comment-17500173 ] Tim Allison edited comment on TIKA-3684 at 3/3/22, 12:41 PM: - I attached an example for turning off the WMFParser and the EMFParser. When calling tika-server with docker, add {{--config tika-config-no-xmf.xml}} was (Author: talli...@mitre.org): I attached an example for turning off the WMFParser and the EMFParser. When calling tika-server in docker, add {{-c tika-config-no-xmf.xml}} > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500720#comment-17500720 ] Tim Allison commented on TIKA-3684: --- Oops. Thank you! > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [tika] tballison merged pull request #519: Bump commons-net from 3.7.2 to 3.8.0
tballison merged pull request #519: URL: https://github.com/apache/tika/pull/519 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] tballison merged pull request #517: Bump bndlib from 1.50.0 to 2.0.0.20130123-133441
tballison merged pull request #517: URL: https://github.com/apache/tika/pull/517 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] tballison merged pull request #518: Bump build-helper-maven-plugin from 3.0.0 to 3.3.0
tballison merged pull request #518: URL: https://github.com/apache/tika/pull/518 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org