[jira] [Updated] (SOLR-9240) Support running the topic() Streaming Expression in parallel mode.

Joel Bernstein (JIRA) Fri, 24 Jun 2016 19:24:57 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joel Bernstein updated SOLR-9240:
---------------------------------
    Description: 
Currently the topic() function won't run in parallel mode because each worker 
needs to maintain a separate set of checkpoints. The proposed solution for this 
is to append the worker ID to the topic ID, which will cause each worker to 
have it's own checkpoints.

It would be useful to support parallelizing the topic function because it will 
provide a general purpose approach for processing text in parallel across 
worker nodes.

For example this would allow a classify() function to be wrapped around a 
topic() function to classify documents in parallel across worker nodes. 

Sample syntax:

{code}
parallel(daemon(update(classify(topic(..., partitionKeys="id")))))
{code}

The example above would send a daemon to worker nodes that would classify all 
documents returned by the topic() function. The update function would send the 
output of classify() to a SolrCloud collection for indexing.

The partitionKeys parameter would ensure that each worker would receive a 
partition of the results returned by the topic() function. This allows the 
classify() function to be run in parallel.






  was:
Currently the topic() function doesn't accept a partitionKeys parameter like 
the search() function does. This means the topic() function can't be wrapped by 
the parallel() function to run across worker nodes.

It would be useful to support parallelizing the topic function because it would 
provide a general purpose parallelized approach for processing batches of data 
as they enter the index.

For example this would allow a classify() function to be wrapped around a 
topic() function to classify documents in parallel across worker nodes. 

Sample syntax:

{code}
parallel(daemon(update(classify(topic(..., partitionKeys="id")))))
{code}

The example above would send a daemon to worker nodes that would classify all 
new documents returned by the topic() function. The update function would send 
the output of classify() to a SolrCloud collection for indexing.

The partitionKeys parameter would ensure that each worker would receive a 
partition of the results returned by the topic() function. This allows the 
classify() function to be run in parallel.







> Support running the topic() Streaming Expression in parallel mode.
> ------------------------------------------------------------------
>
>                 Key: SOLR-9240
>                 URL: https://issues.apache.org/jira/browse/SOLR-9240
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>
> Currently the topic() function won't run in parallel mode because each worker 
> needs to maintain a separate set of checkpoints. The proposed solution for 
> this is to append the worker ID to the topic ID, which will cause each worker 
> to have it's own checkpoints.
> It would be useful to support parallelizing the topic function because it 
> will provide a general purpose approach for processing text in parallel 
> across worker nodes.
> For example this would allow a classify() function to be wrapped around a 
> topic() function to classify documents in parallel across worker nodes. 
> Sample syntax:
> {code}
> parallel(daemon(update(classify(topic(..., partitionKeys="id")))))
> {code}
> The example above would send a daemon to worker nodes that would classify all 
> documents returned by the topic() function. The update function would send 
> the output of classify() to a SolrCloud collection for indexing.
> The partitionKeys parameter would ensure that each worker would receive a 
> partition of the results returned by the topic() function. This allows the 
> classify() function to be run in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-9240) Support running the topic() Streaming Expression in parallel mode.

Reply via email to