[
https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Bernstein updated SOLR-9240:
---------------------------------
Description:
It would be useful for Solr to support large scale *Extract, Transform and
Load* use cases with streaming expressions. Instead of using MapReduce for the
ETL, the topic expression will used and SolrCloud will be treated like a giant
message queue filled with data to be processed.
This ticket makes two small changes to the topic() expression that makes this
possible:
1) Changes the topic() behavior so it can operate in parallel.
2) Adds the initialCheckpoint parameter to the topic expression so a topic can
start pulling records from anywhere in the queue.
Daemons can then be sent to worker nodes that each work on processing a
partition of the data from the same topic. The daemon() functions natural
behavior is perfect for iteratively calling a topic until all records in the
topic have been processed.
was:
Currently the topic() function won't run in parallel mode because each worker
needs to maintain a separate set of checkpoints. The proposed solution for this
is to append the worker ID to the topic ID, which will cause each worker to
have it's own checkpoints.
It would be useful to support parallelizing the topic function because it will
provide a general purpose approach for processing text in parallel across
worker nodes.
For example this would allow a classify() function to be wrapped around a
topic() function to classify documents in parallel across worker nodes.
Sample syntax:
{code}
parallel(daemon(update(classify(topic(..., partitionKeys="id")))))
{code}
The example above would send a daemon to worker nodes that would classify all
documents returned by the topic() function. The update function would send the
output of classify() to a SolrCloud collection for indexing.
The partitionKeys parameter would ensure that each worker would receive a
partition of the results returned by the topic() function. This allows the
classify() function to be run in parallel.
> Support parallel ETL with the topic expression
> ----------------------------------------------
>
> Key: SOLR-9240
> URL: https://issues.apache.org/jira/browse/SOLR-9240
> Project: Solr
> Issue Type: Improvement
> Reporter: Joel Bernstein
> Assignee: Joel Bernstein
> Attachments: SOLR-9240.patch, SOLR-9240.patch
>
>
> It would be useful for Solr to support large scale *Extract, Transform and
> Load* use cases with streaming expressions. Instead of using MapReduce for
> the ETL, the topic expression will used and SolrCloud will be treated like a
> giant message queue filled with data to be processed.
> This ticket makes two small changes to the topic() expression that makes this
> possible:
> 1) Changes the topic() behavior so it can operate in parallel.
> 2) Adds the initialCheckpoint parameter to the topic expression so a topic
> can start pulling records from anywhere in the queue.
> Daemons can then be sent to worker nodes that each work on processing a
> partition of the data from the same topic. The daemon() functions natural
> behavior is perfect for iteratively calling a topic until all records in the
> topic have been processed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]