[
https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Bernstein updated SOLR-9240:
---------------------------------
Description:
Currently the topic() function won't run in parallel mode because each worker
needs to maintain a separate set of checkpoints. The proposed solution for this
is to append the worker ID to the topic ID, which will cause each worker to
have it's own checkpoints.
It would be useful to support parallelizing the topic function because it will
provide a general purpose approach for processing text in parallel across
worker nodes.
For example this would allow a classify() function to be wrapped around a
topic() function to classify documents in parallel across worker nodes.
Sample syntax:
{code}
parallel(daemon(update(classify(topic(..., partitionKeys="id")))))
{code}
The example above would send a daemon to worker nodes that would classify all
documents returned by the topic() function. The update function would send the
output of classify() to a SolrCloud collection for indexing.
The partitionKeys parameter would ensure that each worker would receive a
partition of the results returned by the topic() function. This allows the
classify() function to be run in parallel.
was:
Currently the topic() function doesn't accept a partitionKeys parameter like
the search() function does. This means the topic() function can't be wrapped by
the parallel() function to run across worker nodes.
It would be useful to support parallelizing the topic function because it would
provide a general purpose parallelized approach for processing batches of data
as they enter the index.
For example this would allow a classify() function to be wrapped around a
topic() function to classify documents in parallel across worker nodes.
Sample syntax:
{code}
parallel(daemon(update(classify(topic(..., partitionKeys="id")))))
{code}
The example above would send a daemon to worker nodes that would classify all
new documents returned by the topic() function. The update function would send
the output of classify() to a SolrCloud collection for indexing.
The partitionKeys parameter would ensure that each worker would receive a
partition of the results returned by the topic() function. This allows the
classify() function to be run in parallel.
> Support running the topic() Streaming Expression in parallel mode.
> ------------------------------------------------------------------
>
> Key: SOLR-9240
> URL: https://issues.apache.org/jira/browse/SOLR-9240
> Project: Solr
> Issue Type: Improvement
> Reporter: Joel Bernstein
> Assignee: Joel Bernstein
>
> Currently the topic() function won't run in parallel mode because each worker
> needs to maintain a separate set of checkpoints. The proposed solution for
> this is to append the worker ID to the topic ID, which will cause each worker
> to have it's own checkpoints.
> It would be useful to support parallelizing the topic function because it
> will provide a general purpose approach for processing text in parallel
> across worker nodes.
> For example this would allow a classify() function to be wrapped around a
> topic() function to classify documents in parallel across worker nodes.
> Sample syntax:
> {code}
> parallel(daemon(update(classify(topic(..., partitionKeys="id")))))
> {code}
> The example above would send a daemon to worker nodes that would classify all
> documents returned by the topic() function. The update function would send
> the output of classify() to a SolrCloud collection for indexing.
> The partitionKeys parameter would ensure that each worker would receive a
> partition of the results returned by the topic() function. This allows the
> classify() function to be run in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]