[jira] [Comment Edited] (SOLR-10351) Add analyze Stream Evaluator to support streaming NLP

2017-04-01 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952283#comment-15952283
 ] 

Joel Bernstein edited comment on SOLR-10351 at 4/1/17 3:46 PM:
---

bq. Wouldn't the NLP processing as advertised in the title of this issue be 
most likely to put it's processing into analysis attributes? This stream 
evaluator only emits the character data attribute.

Possibly. I definitely have much to learn about the analysis chain. In the 
first pass I was mostly interested in getting the token stream from the 
analysis chain. What I had envisioned in the future was having analysis chains 
that perform sentence chunking, entity extraction, noun phrase extraction 
etc... I was seeing these as a finished token streams. But exposing the 
analysis attributes would seem to make sense in the future.

bq. BTW Please use try-finally (even try-with-resources style) to close 
token-streams wherever possible. Analyzer internal parts are internally shared 
in thread-locals and the ramifications can be nasty on the entire Solr node if 
at any time one filter has a bug or something on a particular value. Your Solr 
node then becomes poisoned in a sense and only a restart will fix the ailment.

Will do.


was (Author: joel.bernstein):
bq. Wouldn't the NLP processing as advertised in the title of this issue be 
most likely to put it's processing into analysis attributes? This stream 
evaluator only emits the character data attribute.

Possibly. I definitely have much to learn about the analysis chain. In the 
first pass I was mostly interested in getting the token stream from the 
analysis chain. What I had envisioned in the future was having token streams 
that perform sentence chunking, entity extraction, noun phrase extraction 
etc... I was seeing these as a finished token streams. But exposing the 
analysis attributes would seem to make sense in the future.

bq. BTW Please use try-finally (even try-with-resources style) to close 
token-streams wherever possible. Analyzer internal parts are internally shared 
in thread-locals and the ramifications can be nasty on the entire Solr node if 
at any time one filter has a bug or something on a particular value. Your Solr 
node then becomes poisoned in a sense and only a restart will fix the ailment.

Will do.

> Add analyze Stream Evaluator to support streaming NLP
> -
>
> Key: SOLR-10351
> URL: https://issues.apache.org/jira/browse/SOLR-10351
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>  Labels: NLP, Streaming
> Fix For: 6.6
>
> Attachments: SOLR-10351.patch, SOLR-10351.patch, SOLR-10351.patch, 
> SOLR-10351.patch
>
>
> The *analyze* Stream Evaluator uses a Solr analyzer to return a collection of 
> tokens from a *text field*. The collection of tokens can then be streamed out 
> by  the *cartesianProduct* Streaming Expression or attached to documents as 
> multi-valued fields by the *select* Streaming Expression.
> This allows Streaming Expressions to leverage all the existing tokenizers and 
> filters and provides a place for future NLP analyzers to be added to 
> Streaming Expressions.
> Sample syntax:
> {code}
> cartesianProduct(expr, analyze(analyzerField, textField) as outfield )
> {code}
> {code}
> select(expr, analyze(analyzerField, textField) as outfield )
> {code}
> Combined with Solr's batch text processing capabilities this provides an 
> entire parallel NLP framework. Solr's batch processing capabilities are 
> described here:
> *Batch jobs, Parallel ETL and Streaming Text Transformation*
> http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-10351) Add analyze Stream Evaluator to support streaming NLP

2017-03-30 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949342#comment-15949342
 ] 

Joel Bernstein edited comment on SOLR-10351 at 3/30/17 4:26 PM:


Added a test with the select function


was (Author: joel.bernstein):
Added a test with select function

> Add analyze Stream Evaluator to support streaming NLP
> -
>
> Key: SOLR-10351
> URL: https://issues.apache.org/jira/browse/SOLR-10351
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>  Labels: NLP, Streaming
> Fix For: 6.6
>
> Attachments: SOLR-10351.patch, SOLR-10351.patch, SOLR-10351.patch, 
> SOLR-10351.patch
>
>
> The *analyze* Stream Evaluator uses a Solr analyzer to return a collection of 
> tokens from a *text field*. The collection of tokens can then be streamed out 
> by  the *cartesianProduct* Streaming Expression or attached to documents as 
> multi-valued fields by the *select* Streaming Expression.
> This allows Streaming Expressions to leverage all the existing tokenizers and 
> filters and provides a place for future NLP analyzers to be added to 
> Streaming Expressions.
> Sample syntax:
> {code}
> cartesianProduct(expr, analyze(analyzerField, textField) as outfield )
> {code}
> {code}
> select(expr, analyze(analyzerField, textField) as outfield )
> {code}
> Combined with Solr's batch text processing capabilities this provides an 
> entire parallel NLP framework. Solr's batch processing capabilities are 
> described here:
> *Batch jobs, Parallel ETL and Streaming Text Transformation*
> http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org