[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-10 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504150#comment-17504150
 ] 

Nick Burch commented on TIKA-3684:
--

Same as Tika 2.x - pass a {{--config}} flag when you start the server

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-09 Thread Naama Hophstatder (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504036#comment-17504036
 ] 

Naama Hophstatder commented on TIKA-3684:
-

I don't know how should I configure the service as I'm running it locally, not 
in a docker container.

The docs just speaks about 2.0, so can you help us in configuring local 1.24 
tika-server as a linux service?

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503637#comment-17503637
 ] 

Tim Allison commented on TIKA-3684:
---

That configuration should work with 1.24 as well.  Is it not working for you?

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-09 Thread Naama Hophstatder (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503397#comment-17503397
 ] 

Naama Hophstatder commented on TIKA-3684:
-

Hi [~tallison] , could you help us using the config file you attached to the 
production tika-server?

We use version 1.24 as a Linux service, and I'm not sure of the correct way to 
do it.

Thanks again!

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500720#comment-17500720
 ] 

Tim Allison commented on TIKA-3684:
---

Oops. Thank you!

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-02 Thread Naama Hophstatder (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500554#comment-17500554
 ] 

Naama Hophstatder commented on TIKA-3684:
-

Thanks for your efforts, I took your xml config file and it works.

Just to mention, the correct way for me to configure tika-server in docker 
(version 2.1.0) is by this command:

docker run -d -p :9998 -v `pwd`/tika-config.xml:/tika-config.xml 
apache/tika:2.1.0 --confi
g /tika-config.xml (taken from [tika docker 
repo|https://github.com/apache/tika-docker]

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500173#comment-17500173
 ] 

Tim Allison commented on TIKA-3684:
---

I attached an example for turning off the WMFParser and the EMFParser. When 
calling tika-server in docker, add {{-c tika-config-no-xmf.xml}}

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json, tika-config-no-xmf.xml
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500168#comment-17500168
 ] 

Tim Allison commented on TIKA-3684:
---

We could also parameterize the WMF and EMF parsers to turn off text extraction. 
 It _feels_ like wmf+emf used to be used for new information, images, etc 
within a page, but more recently, I'm seeing it being used as a thumbnail.

Another option for improvement would be to allow configuration of the embedded 
parser to skip thumbnails.  This emf is correctly identified as a thumbnail in 
its metadata in the /rmeta output.  I cannot guarantee that all emf/wmf will be 
identified as such, though.

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500170#comment-17500170
 ] 

Tim Allison commented on TIKA-3684:
---

Sorry, didn't see your response.  

bq. has no "text" meaning in this situation

In this situation (especially explicitly marked as a thumbnail), I agree.  
However, there are others where there is new/novel text in the wmf+emf.

Example coming shortly.

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-02 Thread Naama Hophstatder (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500143#comment-17500143
 ] 

Naama Hophstatder commented on TIKA-3684:
-

I see the results of the /rmeta endpoint, understand where the issue comes 
from, but as far as I understand the emf/wmf attachments has no "text" meaning 
in this situation, so I want to disable the related parsers.

Can you give me an example of how can I turn them off? And if it changes while 
working in docker container mode?

Thanks in advance.

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500066#comment-17500066
 ] 

Tim Allison commented on TIKA-3684:
---

If you use the /rmeta endpoint (attached), you can see that there's a 
thumbnail.emf, which also contains the text, and that file contains another 
attachment, a .wmf file, that also contains the text.

We didn't have parsers for emf/wmf back in 1.14.  You can turn off those 
parsers via tika-config.xml (I'm happy to give an example if needed).  The risk 
is that emf files can contain attachments...so you may miss information.

> Extract text returns the text multiple times
> 
>
> Key: TIKA-3684
> URL: https://issues.apache.org/jira/browse/TIKA-3684
> Project: Tika
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 2.1.0
>Reporter: Naama Hophstatder
>Priority: Major
> Attachments: example.docx, example.json
>
>
> We are using tika docker container as a linux service, when I want to extract 
> text from a word document, e.g.:
> curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain"
> we get the text 3 times.
> Notice: We also have tika server v1.14, and this version returns the text 
> just as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)