[
https://issues.apache.org/jira/browse/TIKA-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013472#comment-16013472
]
Thamme Gowda commented on TIKA-2360:
------------------------------------
Sorry, I am late to the discussion.
1. (y) to turn it OFF. I had no intention to turn it on by default (The code
existed before in the previously abandoned PR, and I did not think it could
become a problem).
2. About downloading models: We can configure maven build to download the
models during the first time run. The downloaded files shall reside in
src/test/resources. I think no need to include models in src/main/resources
since this feature is off by default and these models increases final jar size.
Then whoever turns on the senti-analysis feature should manually configure it
(Step1. wget the model and Step2. set its path in the XML File)
3. regarding Parser or Not:
Agree with [[email protected]].
We have two kinds of recognisers - so we need two parsers.
First: NER, SentimentAnalysis, AgePredictor or any other text/NLP classifier -
INPUT:text input and OUTPUT:set of metadata key values.
Second:ObjectRecogniser, VideoLabeller, OCR, Caption - INPUT:raw bytes and
OUTPUT:set of metadata key values.
My suggestion: Let us create two generic parsers. First one extracts text and
the other one does not extract. All the machine learning (ML) actions can be
seen as add-ons to these two parsers. We can let configurations to enable and
disable the add-ons.
The ML features that we can support by holding its input content in memory
(such as extracted text) can be add-ons to the generic parser, with this we can
call many add-ons in line per one read-parse-extract call, and merge all the
metadata.
The ML features for which we cannot hold its content in memory (such as a large
video) can be independent parsers, we shall let it stream the raw content
directly in its own.
WDYT about this approach?
> Handle SentimentParser resource failure more robustly
> -----------------------------------------------------
>
> Key: TIKA-2360
> URL: https://issues.apache.org/jira/browse/TIKA-2360
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Blocker
> Fix For: 1.15
>
>
> The SentimentParser tests currently require a network call to github. For
> those working behind a proxy or would prefer Tika not to make unexpected
> network calls, can we please turn this off by default?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)