[jira] [Updated] (SOLR-9240) Support parallel ETL with the topic expression

Joel Bernstein (JIRA) Tue, 12 Jul 2016 08:50:46 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joel Bernstein updated SOLR-9240:
---------------------------------
    Description: 
It would be useful for Solr to support large scale *Extract, Transform and 
Load* use cases with streaming expressions. Instead of using MapReduce for the 
ETL, the topic expression will used and SolrCloud will be treated like a giant 
message queue filled with data to be processed.

This ticket makes two small changes to the topic() expression that makes this 
possible:

1) Changes the topic() behavior so it can operate in parallel.
2) Adds the initialCheckpoint parameter to the topic expression so a topic can 
start pulling records from anywhere in the queue.

Daemons can then be sent to worker nodes that each work on processing a 
partition of the data from the same topic. The daemon() functions natural 
behavior is perfect for iteratively calling a topic until all records in the 
topic have been processed.





  was:
Currently the topic() function won't run in parallel mode because each worker 
needs to maintain a separate set of checkpoints. The proposed solution for this 
is to append the worker ID to the topic ID, which will cause each worker to 
have it's own checkpoints.

It would be useful to support parallelizing the topic function because it will 
provide a general purpose approach for processing text in parallel across 
worker nodes.

For example this would allow a classify() function to be wrapped around a 
topic() function to classify documents in parallel across worker nodes. 

Sample syntax:

{code}
parallel(daemon(update(classify(topic(..., partitionKeys="id")))))
{code}

The example above would send a daemon to worker nodes that would classify all 
documents returned by the topic() function. The update function would send the 
output of classify() to a SolrCloud collection for indexing.

The partitionKeys parameter would ensure that each worker would receive a 
partition of the results returned by the topic() function. This allows the 
classify() function to be run in parallel.







> Support parallel ETL with the topic expression
> ----------------------------------------------
>
>                 Key: SOLR-9240
>                 URL: https://issues.apache.org/jira/browse/SOLR-9240
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>         Attachments: SOLR-9240.patch, SOLR-9240.patch
>
>
> It would be useful for Solr to support large scale *Extract, Transform and 
> Load* use cases with streaming expressions. Instead of using MapReduce for 
> the ETL, the topic expression will used and SolrCloud will be treated like a 
> giant message queue filled with data to be processed.
> This ticket makes two small changes to the topic() expression that makes this 
> possible:
> 1) Changes the topic() behavior so it can operate in parallel.
> 2) Adds the initialCheckpoint parameter to the topic expression so a topic 
> can start pulling records from anywhere in the queue.
> Daemons can then be sent to worker nodes that each work on processing a 
> partition of the data from the same topic. The daemon() functions natural 
> behavior is perfect for iteratively calling a topic until all records in the 
> topic have been processed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-9240) Support parallel ETL with the topic expression

Reply via email to