[GitHub] [tika] dependabot[bot] opened a new pull request #527: Bump dl4j.version from 1.0.0-M1.1 to 1.0.0-M2
dependabot[bot] opened a new pull request #527: URL: https://github.com/apache/tika/pull/527 Bumps `dl4j.version` from 1.0.0-M1.1 to 1.0.0-M2. Updates `datavec-data-image` from 1.0.0-M1.1 to 1.0.0-M2 Updates `deeplearning4j-zoo` from 1.0.0-M1.1 to 1.0.0-M2 Updates `deeplearning4j-modelimport` from 1.0.0-M1.1 to 1.0.0-M2 Updates `deeplearning4j-nn` from 1.0.0-M1.1 to 1.0.0-M2 Updates `nd4j-api` from 1.0.0-M1.1 to 1.0.0-M2 Updates `nd4j-native-platform` from 1.0.0-M1.1 to 1.0.0-M2 Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504036#comment-17504036 ] Naama Hophstatder commented on TIKA-3684: - I don't know how should I configure the service as I'm running it locally, not in a docker container. The docs just speaks about 2.0, so can you help us in configuring local 1.24 tika-server as a linux service? > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3694) Tika Server endpoint to return more details on a mime type
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503774#comment-17503774 ] Hudson commented on TIKA-3694: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #483 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/483/]) TIKA-3694 - trivial checkstyle fixes (tallison: [https://github.com/apache/tika/commit/43b1284cf74467a94e3064108ab661fdcfa4d8a0]) * (edit) tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaMimeTypesTest.java * (edit) tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaMimeTypes.java > Tika Server endpoint to return more details on a mime type > -- > > Key: TIKA-3694 > URL: https://issues.apache.org/jira/browse/TIKA-3694 > Project: Tika > Issue Type: Improvement > Components: mime, server >Affects Versions: 2.3.0 >Reporter: Nick Burch >Assignee: Tim Allison >Priority: Major > Fix For: 2.3.1 > > > As raised on the user list - > [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users > calling the Java APIs are able to get additional details on a mime type, such > as common extensions and descriptions. Those calling the Tika Server can only > get limited information on mime types, such as which are known to Tika > In addition to the current {{/mime-types}} endpoint (html/json/text), we > should add a more detailed one that takes a specific type. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (TIKA-3697) Add parser for warc files
Tim Allison created TIKA-3697: - Summary: Add parser for warc files Key: TIKA-3697 URL: https://issues.apache.org/jira/browse/TIKA-3697 Project: Tika Issue Type: Task Reporter: Tim Allison netpreserve's jwarc is ASL 2.0, fairly small and no dependencies. Should we add this into tika-parsers-standard or create a separate package for it in tika-parsers-extended? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (TIKA-3696) Add detection for wacz files
Tim Allison created TIKA-3696: - Summary: Add detection for wacz files Key: TIKA-3696 URL: https://issues.apache.org/jira/browse/TIKA-3696 Project: Tika Issue Type: Task Reporter: Tim Allison https://webrecorder.github.io/wacz-spec/1.2.0/ Zip file with standard entries: 'archive', 'datapackage.json', 'indexes' and 'pages'. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3695) LimitingMetadataFilter
[ https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503723#comment-17503723 ] Tim Allison commented on TIKA-3695: --- Maybe extend Metadata as SecureMetadata and put the logic there with configuration via AutoDetectParser? That won't work easily because parsers create new Metadata objects for embedded files. Do we need to pass around a MetadataFactory or MetadataWriter? That'll still require a bunch of changes, but would be cleaner? > LimitingMetadataFilter > -- > > Key: TIKA-3695 > URL: https://issues.apache.org/jira/browse/TIKA-3695 > Project: Tika > Issue Type: New Feature > Components: metadata >Affects Versions: 1.28.1, 2.3.0 >Reporter: Julien Massiera >Priority: Major > > Some files may contain abnormally big metadata (several MB, be it for the > metadata values, the metadata names, but also for the total amount of > metadata) that can be problematic concerning the memory consumption. > It would be great to develop a new LimitingMetadataFilter so that we can > filter out the metadata according to different bytes limits (on metadata > names, metadata values and global amount of metadata) > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (TIKA-3694) Tika Server endpoint to return more details on a mime type
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3694. --- Resolution: Fixed trivial fixes made. Thank you, again, [~nick]! > Tika Server endpoint to return more details on a mime type > -- > > Key: TIKA-3694 > URL: https://issues.apache.org/jira/browse/TIKA-3694 > Project: Tika > Issue Type: Improvement > Components: mime, server >Affects Versions: 2.3.0 >Reporter: Nick Burch >Assignee: Tim Allison >Priority: Major > Fix For: 2.3.1 > > > As raised on the user list - > [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users > calling the Java APIs are able to get additional details on a mime type, such > as common extensions and descriptions. Those calling the Tika Server can only > get limited information on mime types, such as which are known to Tika > In addition to the current {{/mime-types}} endpoint (html/json/text), we > should add a more detailed one that takes a specific type. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Reopened] (TIKA-3694) Tika Server endpoint to return more details on a mime type
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-3694: --- Assignee: Tim Allison Fixing checkstyle issues > Tika Server endpoint to return more details on a mime type > -- > > Key: TIKA-3694 > URL: https://issues.apache.org/jira/browse/TIKA-3694 > Project: Tika > Issue Type: Improvement > Components: mime, server >Affects Versions: 2.3.0 >Reporter: Nick Burch >Assignee: Tim Allison >Priority: Major > Fix For: 2.3.1 > > > As raised on the user list - > [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users > calling the Java APIs are able to get additional details on a mime type, such > as common extensions and descriptions. Those calling the Tika Server can only > get limited information on mime types, such as which are known to Tika > In addition to the current {{/mime-types}} endpoint (html/json/text), we > should add a more detailed one that takes a specific type. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter
[ https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661 ] Tim Allison edited comment on TIKA-3695 at 3/9/22, 4:01 PM: On the list, I suggested implementing this as a MetadataFilter. These are used by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in tika-app), and they are triggered after the parse of the file. On further thought, I'm not sure this is the best option for two reasons. The metadata would be extracted and stored in the Metadata object within Tika, but then truncated/removed before returning the data to the user...so the data will still be in memory and will consume Tika memory resources until the file's parsing is finished. Also, this solution will not play well with the traditional xhtml output of /tika where whatever is in the metadata object is written when the parser hits the first bit of content text, not after the parse has concluded. I'm wondering if we need to put these protections deeper into the Metadata object itself so that it isn't storing this info and then removing it. If a parser tries to write too much metadata, do we throw a WriteLimitException and stop parsing, or do we keep parsing but add a "metadata truncation" flag to the metadata object? I'd be inclined to the latter. How do we parameterize the limits? New parameters on AutoDetectParser? New MetadataWriter class (yikes...). Thoughts? was (Author: talli...@mitre.org): On the list, I suggested implementing this as a MetadataFilter. These are used by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in tika-app). They are triggered after the parse of the file. On further thought, I'm not sure this is the best option for two reason. The metadata would be extracted and stored in the Metadata object within Tika, but then truncated/removed before returning the data to the user...so the data will still be in memory and will consume Tika memory resources until the file's parsing is finished. Also, this solution will not play well with the traditional xhtml output of /tika where whatever is in the metadata object is written when the parser hits the first bit of content text, not after the parse has concluded. I'm wondering if we need to put these protections deeper into the Metadata object itself so that it isn't storing this info and then removing it. If a parser tries to write too much metadata, do we throw a WriteLimitException and stop parsing, or do we keep parsing but add a "metadata truncation" flag to the metadata object? I'd be inclined to the latter. Thoughts? How do we configure it...new parameters on AutoDetectParser? > LimitingMetadataFilter > -- > > Key: TIKA-3695 > URL: https://issues.apache.org/jira/browse/TIKA-3695 > Project: Tika > Issue Type: New Feature > Components: metadata >Affects Versions: 1.28.1, 2.3.0 >Reporter: Julien Massiera >Priority: Major > > Some files may contain abnormally big metadata (several MB, be it for the > metadata values, the metadata names, but also for the total amount of > metadata) that can be problematic concerning the memory consumption. > It would be great to develop a new LimitingMetadataFilter so that we can > filter out the metadata according to different bytes limits (on metadata > names, metadata values and global amount of metadata) > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter
[ https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661 ] Tim Allison edited comment on TIKA-3695 at 3/9/22, 3:47 PM: On the list, I suggested implementing this as a MetadataFilter. These are used by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in tika-app). They are triggered after the parse of the file. On further thought, I'm not sure this is the best option for two reason. The metadata would be extracted and stored in the Metadata object within Tika, but then truncated/removed before returning the data to the user...so the data will still be in memory and will consume Tika memory resources until the file's parsing is finished. Also, this solution will not play well with the traditional xhtml output of /tika where whatever is in the metadata object is written when the parser hits the first bit of content text, not after the parse has concluded. I'm wondering if we need to put these protections deeper into the Metadata object itself so that it isn't storing this info and then removing it. If a parser tries to write too much metadata, do we throw a WriteLimitException and stop parsing, or do we keep parsing but add a "metadata truncation" flag to the metadata object? I'd be inclined to the latter. Thoughts? How do we configure it...new parameters on AutoDetectParser? was (Author: talli...@mitre.org): On the list, I suggested implementing this as a MetadataFilter. These are used by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in tika-app). They are triggered after the parse of the file. On further thought, I'm not sure this is the best option for two reason. The metadata would be extracted and stored in the Metadata object within Tika, but then truncated/removed before returning the data to the user...so the data will still be in memory and will consume Tika memory resources until the file's parsing is finished. Also, this solution will not play well with the traditional xhtml output of /tika where whatever is in the metadata object is written when the parser hits the first bit of content text, not after the parse has concluded. I'm wondering if we need to put these protections deeper into the Metadata object itself so that it isn't storing this info and then removing it. Thoughts? How do we configure it...new parameters on AutoDetectParser? > LimitingMetadataFilter > -- > > Key: TIKA-3695 > URL: https://issues.apache.org/jira/browse/TIKA-3695 > Project: Tika > Issue Type: New Feature > Components: metadata >Affects Versions: 1.28.1, 2.3.0 >Reporter: Julien Massiera >Priority: Major > > Some files may contain abnormally big metadata (several MB, be it for the > metadata values, the metadata names, but also for the total amount of > metadata) that can be problematic concerning the memory consumption. > It would be great to develop a new LimitingMetadataFilter so that we can > filter out the metadata according to different bytes limits (on metadata > names, metadata values and global amount of metadata) > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [tika] tballison merged pull request #526: Bump aws.version from 1.12.169 to 1.12.174
tballison merged pull request #526: URL: https://github.com/apache/tika/pull/526 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] tballison merged pull request #525: Bump gson from 2.8.9 to 2.9.0
tballison merged pull request #525: URL: https://github.com/apache/tika/pull/525 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] tballison commented on pull request #525: Bump gson from 2.8.9 to 2.9.0
tballison commented on pull request #525: URL: https://github.com/apache/tika/pull/525#issuecomment-1063055561 Ah, TIKA-3694. https://github.com/apache/tika/actions/runs/1948546338 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3694) Tika Server endpoint to return more details on a mime type
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503672#comment-17503672 ] Tim Allison commented on TIKA-3694: --- Looks like checkstyle is not happy with these commits. > Tika Server endpoint to return more details on a mime type > -- > > Key: TIKA-3694 > URL: https://issues.apache.org/jira/browse/TIKA-3694 > Project: Tika > Issue Type: Improvement > Components: mime, server >Affects Versions: 2.3.0 >Reporter: Nick Burch >Priority: Major > Fix For: 2.3.1 > > > As raised on the user list - > [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users > calling the Java APIs are able to get additional details on a mime type, such > as common extensions and descriptions. Those calling the Tika Server can only > get limited information on mime types, such as which are known to Tika > In addition to the current {{/mime-types}} endpoint (html/json/text), we > should add a more detailed one that takes a specific type. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (TIKA-3694) Tika Server endpoint to return more details on a mime type
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503672#comment-17503672 ] Tim Allison edited comment on TIKA-3694 at 3/9/22, 3:41 PM: [~nick]. Thank you for doing this! Looks like checkstyle is not happy with these commits. was (Author: talli...@mitre.org): Looks like checkstyle is not happy with these commits. > Tika Server endpoint to return more details on a mime type > -- > > Key: TIKA-3694 > URL: https://issues.apache.org/jira/browse/TIKA-3694 > Project: Tika > Issue Type: Improvement > Components: mime, server >Affects Versions: 2.3.0 >Reporter: Nick Burch >Priority: Major > Fix For: 2.3.1 > > > As raised on the user list - > [https://lists.apache.org/thread/mhtj6cgf323525hs6dow1oz68nkqwfgy] - users > calling the Java APIs are able to get additional details on a mime type, such > as common extensions and descriptions. Those calling the Tika Server can only > get limited information on mime types, such as which are known to Tika > In addition to the current {{/mime-types}} endpoint (html/json/text), we > should add a more detailed one that takes a specific type. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [tika] tballison commented on pull request #525: Bump gson from 2.8.9 to 2.9.0
tballison commented on pull request #525: URL: https://github.com/apache/tika/pull/525#issuecomment-1063052166 I have no idea how checkstyle failed on this one... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [tika] tballison merged pull request #524: Bump jackson.version from 2.13.1 to 2.13.2
tballison merged pull request #524: URL: https://github.com/apache/tika/pull/524 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter
[ https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661 ] Tim Allison edited comment on TIKA-3695 at 3/9/22, 3:36 PM: On the list, I suggested implementing this as a MetadataFilter. These are used by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in tika-app). They are triggered after the parse of the file. On further thought, I'm not sure this is the best option for two reason. The metadata would be extracted and stored in the Metadata object within Tika, but then truncated/removed before returning the data to the user...so the data will still be in memory and will consume Tika memory resources until the file's parsing is finished. Also, this solution will not play well with the traditional xhtml output of /tika where whatever is in the metadata object is written when the parser hits the first bit of content text, not after the parse has concluded. I'm wondering if we need to put these protections deeper into the Metadata object itself so that it isn't storing this info and then removing it. Thoughts? How do we configure it...new parameters on AutoDetectParser? was (Author: talli...@mitre.org): On the list, I suggested implementing this as a MetadataFilter. These are used by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in tika-app). They are triggered after the parse of the file. In this case, the metadata would be extracted and stored in the Metadata object within Tika, but then truncated/removed before returning the data to the user. This solution will not play well with the traditional xhtml output of /tika where whatever is in the metadata object is written when the parser hits the first bit of content text. I'm wondering if we need to put these protections deeper into the Metadata object itself so that it isn't storing this info and then removing it. Thoughts? How do we configure it...new parameters on AutoDetectParser? > LimitingMetadataFilter > -- > > Key: TIKA-3695 > URL: https://issues.apache.org/jira/browse/TIKA-3695 > Project: Tika > Issue Type: New Feature > Components: metadata >Affects Versions: 1.28.1, 2.3.0 >Reporter: Julien Massiera >Priority: Major > > Some files may contain abnormally big metadata (several MB, be it for the > metadata values, the metadata names, but also for the total amount of > metadata) that can be problematic concerning the memory consumption. > It would be great to develop a new LimitingMetadataFilter so that we can > filter out the metadata according to different bytes limits (on metadata > names, metadata values and global amount of metadata) > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3695) LimitingMetadataFilter
[ https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661 ] Tim Allison commented on TIKA-3695: --- On the list, I suggested implementing this as a MetadataFilter. These are used by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in tika-app). They are triggered after the parse of the file. In this case, the metadata would be extracted and stored in the Metadata object within Tika, but then truncated/removed before returning the data to the user. This solution will not play well with the traditional xhtml output of /tika where whatever is in the metadata object is written when the parser hits the first bit of content text. I'm wondering if we need to put these protections deeper into the Metadata object itself so that it isn't storing this info and then removing it. Thoughts? How do we configure it...new parameters on AutoDetectParser? > LimitingMetadataFilter > -- > > Key: TIKA-3695 > URL: https://issues.apache.org/jira/browse/TIKA-3695 > Project: Tika > Issue Type: New Feature > Components: metadata >Affects Versions: 1.28.1, 2.3.0 >Reporter: Julien Massiera >Priority: Major > > Some files may contain abnormally big metadata (several MB, be it for the > metadata values, the metadata names, but also for the total amount of > metadata) that can be problematic concerning the memory consumption. > It would be great to develop a new LimitingMetadataFilter so that we can > filter out the metadata according to different bytes limits (on metadata > names, metadata values and global amount of metadata) > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503637#comment-17503637 ] Tim Allison commented on TIKA-3684: --- That configuration should work with 1.24 as well. Is it not working for you? > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503397#comment-17503397 ] Naama Hophstatder commented on TIKA-3684: - Hi [~tallison] , could you help us using the config file you attached to the production tika-server? We use version 1.24 as a Linux service, and I'm not sure of the correct way to do it. Thanks again! > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)