[
https://issues.apache.org/jira/browse/SOLR-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952283#comment-15952283
]
Joel Bernstein edited comment on SOLR-10351 at 4/1/17 3:46 PM:
---
bq. Wouldn't the NLP processing as advertised in the title of this issue be
most likely to put it's processing into analysis attributes? This stream
evaluator only emits the character data attribute.
Possibly. I definitely have much to learn about the analysis chain. In the
first pass I was mostly interested in getting the token stream from the
analysis chain. What I had envisioned in the future was having analysis chains
that perform sentence chunking, entity extraction, noun phrase extraction
etc... I was seeing these as a finished token streams. But exposing the
analysis attributes would seem to make sense in the future.
bq. BTW Please use try-finally (even try-with-resources style) to close
token-streams wherever possible. Analyzer internal parts are internally shared
in thread-locals and the ramifications can be nasty on the entire Solr node if
at any time one filter has a bug or something on a particular value. Your Solr
node then becomes poisoned in a sense and only a restart will fix the ailment.
Will do.
was (Author: joel.bernstein):
bq. Wouldn't the NLP processing as advertised in the title of this issue be
most likely to put it's processing into analysis attributes? This stream
evaluator only emits the character data attribute.
Possibly. I definitely have much to learn about the analysis chain. In the
first pass I was mostly interested in getting the token stream from the
analysis chain. What I had envisioned in the future was having token streams
that perform sentence chunking, entity extraction, noun phrase extraction
etc... I was seeing these as a finished token streams. But exposing the
analysis attributes would seem to make sense in the future.
bq. BTW Please use try-finally (even try-with-resources style) to close
token-streams wherever possible. Analyzer internal parts are internally shared
in thread-locals and the ramifications can be nasty on the entire Solr node if
at any time one filter has a bug or something on a particular value. Your Solr
node then becomes poisoned in a sense and only a restart will fix the ailment.
Will do.
> Add analyze Stream Evaluator to support streaming NLP
> -
>
> Key: SOLR-10351
> URL: https://issues.apache.org/jira/browse/SOLR-10351
> Project: Solr
> Issue Type: New Feature
> Security Level: Public(Default Security Level. Issues are Public)
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Labels: NLP, Streaming
> Fix For: 6.6
>
> Attachments: SOLR-10351.patch, SOLR-10351.patch, SOLR-10351.patch,
> SOLR-10351.patch
>
>
> The *analyze* Stream Evaluator uses a Solr analyzer to return a collection of
> tokens from a *text field*. The collection of tokens can then be streamed out
> by the *cartesianProduct* Streaming Expression or attached to documents as
> multi-valued fields by the *select* Streaming Expression.
> This allows Streaming Expressions to leverage all the existing tokenizers and
> filters and provides a place for future NLP analyzers to be added to
> Streaming Expressions.
> Sample syntax:
> {code}
> cartesianProduct(expr, analyze(analyzerField, textField) as outfield )
> {code}
> {code}
> select(expr, analyze(analyzerField, textField) as outfield )
> {code}
> Combined with Solr's batch text processing capabilities this provides an
> entire parallel NLP framework. Solr's batch processing capabilities are
> described here:
> *Batch jobs, Parallel ETL and Streaming Text Transformation*
> http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org