[ 
https://issues.apache.org/jira/browse/TIKA-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013472#comment-16013472
 ] 

Thamme Gowda commented on TIKA-2360:
------------------------------------

Sorry, I am late to the discussion.
1. (y) to turn it OFF. I had no intention to turn it on by default (The code 
existed before in the previously abandoned PR, and I did not think it could 
become a problem). 

2. About downloading models: We can configure maven build to download the 
models during the first time run. The downloaded files shall reside in 
src/test/resources. I think no need to include models in src/main/resources 
since this feature is off by default and these models increases final jar size. 
Then whoever turns on the senti-analysis feature should manually configure it 
(Step1.  wget the model and Step2. set its path in the XML File)

3. regarding Parser or Not: 
Agree with [[email protected]]. 
We have two kinds of recognisers - so we need two parsers.
First: NER, SentimentAnalysis, AgePredictor or any other text/NLP classifier -  
INPUT:text input and OUTPUT:set of metadata key values. 
Second:ObjectRecogniser, VideoLabeller, OCR, Caption -  INPUT:raw bytes and 
OUTPUT:set of metadata key values.
My suggestion: Let us create two generic parsers. First one extracts text and 
the other one does not extract. All the machine learning (ML) actions can be 
seen as add-ons to these two parsers. We can let configurations to enable and 
disable the add-ons.
The ML features that we can support by holding its input content in memory 
(such as extracted text) can be add-ons to the generic parser, with this we can 
call many add-ons in line per one read-parse-extract call, and merge all the 
metadata.
The ML features for which we cannot hold its content in memory (such as a large 
video) can be independent parsers, we shall let it stream the raw content 
directly in its own.
WDYT about this approach?

> Handle SentimentParser resource failure more robustly
> -----------------------------------------------------
>
>                 Key: TIKA-2360
>                 URL: https://issues.apache.org/jira/browse/TIKA-2360
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Blocker
>             Fix For: 1.15
>
>
> The SentimentParser tests currently require a network call to github.  For 
> those working behind a proxy or would prefer Tika not to make unexpected 
> network calls, can we please turn this off by default?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to