[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504150#comment-17504150 ] Nick Burch commented on TIKA-3684: -- Same as Tika 2.x - pass a {{--config}} flag when you start the server > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504036#comment-17504036 ] Naama Hophstatder commented on TIKA-3684: - I don't know how should I configure the service as I'm running it locally, not in a docker container. The docs just speaks about 2.0, so can you help us in configuring local 1.24 tika-server as a linux service? > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503637#comment-17503637 ] Tim Allison commented on TIKA-3684: --- That configuration should work with 1.24 as well. Is it not working for you? > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503397#comment-17503397 ] Naama Hophstatder commented on TIKA-3684: - Hi [~tallison] , could you help us using the config file you attached to the production tika-server? We use version 1.24 as a Linux service, and I'm not sure of the correct way to do it. Thanks again! > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500720#comment-17500720 ] Tim Allison commented on TIKA-3684: --- Oops. Thank you! > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500554#comment-17500554 ] Naama Hophstatder commented on TIKA-3684: - Thanks for your efforts, I took your xml config file and it works. Just to mention, the correct way for me to configure tika-server in docker (version 2.1.0) is by this command: docker run -d -p :9998 -v `pwd`/tika-config.xml:/tika-config.xml apache/tika:2.1.0 --confi g /tika-config.xml (taken from [tika docker repo|https://github.com/apache/tika-docker] > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500173#comment-17500173 ] Tim Allison commented on TIKA-3684: --- I attached an example for turning off the WMFParser and the EMFParser. When calling tika-server in docker, add {{-c tika-config-no-xmf.xml}} > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json, tika-config-no-xmf.xml > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500168#comment-17500168 ] Tim Allison commented on TIKA-3684: --- We could also parameterize the WMF and EMF parsers to turn off text extraction. It _feels_ like wmf+emf used to be used for new information, images, etc within a page, but more recently, I'm seeing it being used as a thumbnail. Another option for improvement would be to allow configuration of the embedded parser to skip thumbnails. This emf is correctly identified as a thumbnail in its metadata in the /rmeta output. I cannot guarantee that all emf/wmf will be identified as such, though. > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500170#comment-17500170 ] Tim Allison commented on TIKA-3684: --- Sorry, didn't see your response. bq. has no "text" meaning in this situation In this situation (especially explicitly marked as a thumbnail), I agree. However, there are others where there is new/novel text in the wmf+emf. Example coming shortly. > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500143#comment-17500143 ] Naama Hophstatder commented on TIKA-3684: - I see the results of the /rmeta endpoint, understand where the issue comes from, but as far as I understand the emf/wmf attachments has no "text" meaning in this situation, so I want to disable the related parsers. Can you give me an example of how can I turn them off? And if it changes while working in docker container mode? Thanks in advance. > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500066#comment-17500066 ] Tim Allison commented on TIKA-3684: --- If you use the /rmeta endpoint (attached), you can see that there's a thumbnail.emf, which also contains the text, and that file contains another attachment, a .wmf file, that also contains the text. We didn't have parsers for emf/wmf back in 1.14. You can turn off those parsers via tika-config.xml (I'm happy to give an example if needed). The risk is that emf files can contain attachments...so you may miss information. > Extract text returns the text multiple times > > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker >Affects Versions: 2.1.0 >Reporter: Naama Hophstatder >Priority: Major > Attachments: example.docx, example.json > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)