[ 
https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9240:
---------------------------------
    Description: 
It would be useful for SolrCloud to support large scale *Extract, Transform and 
Load* work loads with streaming expressions. Instead of using MapReduce for 
ETL, the topic expression can be used which allows SolrCloud to be treated like 
a distributed message queue filled with data to be processed. The topic 
expression works in batches and supports retrieval of stored fields, so large 
scale *text ETL* will work well with this approach.

This ticket makes two small changes to the topic() expression that makes 
parallel ETL possible:

1) Changes the topic expression so it can operate in parallel.
2) Adds the initialCheckpoint parameter to the topic expression so a topic can 
start pulling records from anywhere in the queue.

Daemons can be sent to worker nodes that each work on processing a partition of 
the data from the same topic. The daemon() function's natural behavior is 
perfect for iteratively calling a topic until all records in the topic have 
been processed.

The sample code below pulls all records from one collection and indexes them 
into another collection. A Transform function could be wrapped around the 
topic() to transform the records before loading. Custom functions can also be 
built to load the data in parallel to any outside system. 

{code}

parallel(
         workerCollection, 
         workers="2", 
         sort="DaemonOp desc", 
         daemon(
                update(
                      updateCollection, 
                      batchSize=200, 
                      topic(
                          checkpointCollection,
                          topicCollection, 
                          q=*:*, 
                          id="topic1",
                          fl="id, to , from, body", 
                          partitionKeys="id",
                          initialCheckpoint="0")), 
                runInterval="1000", 
                id="daemon1"))
{code}




  was:
It would be useful for SolrCloud to support large scale *Extract, Transform and 
Load* work loads with streaming expressions. Instead of using MapReduce for 
ETL, the topic expression can be used which allows SolrCloud to be treated like 
a distributed message queue filled with data to be processed. The topic 
expression works in batches and supports retrieval of stored fields, so large 
scale *text ETL* will work well with this approach.

This ticket makes two small changes to the topic() expression that makes 
parallel ETL possible:

1) Changes the topic expression so it can operate in parallel.
2) Adds the initialCheckpoint parameter to the topic expression so a topic can 
start pulling records from anywhere in the queue.

Daemons can be sent to worker nodes that each work on processing a partition of 
the data from the same topic. The daemon() function's natural behavior is 
perfect for iteratively calling a topic until all records in the topic have 
been processed.

The sample code below pulls all records from one collection and indexes them 
into another collection. A Transform function could be wrapped around the 
topic() to transform the records before loading. Custom functions can also be 
built to load the data in parallel to any outside system. 

{code}

parallel(
         workerCollection, 
         workers="2", 
         sort="DaemonOp desc", 
         daemon(
               update(
                   updateCollection, 
                   batchSize=200, 
                   topic(
                       checkpointCollection,
                        topicCollection, 
                        q=*:*, 
                        id="topic1",
                        fl="id, to , from, body", 
                        partitionKeys="id",
                        initialCheckpoint="0")), 
                runInterval="1000", 
                id="daemon1"))
{code}





> Support parallel ETL with the topic expression
> ----------------------------------------------
>
>                 Key: SOLR-9240
>                 URL: https://issues.apache.org/jira/browse/SOLR-9240
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>             Fix For: 6.2
>
>         Attachments: SOLR-9240.patch, SOLR-9240.patch
>
>
> It would be useful for SolrCloud to support large scale *Extract, Transform 
> and Load* work loads with streaming expressions. Instead of using MapReduce 
> for ETL, the topic expression can be used which allows SolrCloud to be 
> treated like a distributed message queue filled with data to be processed. 
> The topic expression works in batches and supports retrieval of stored 
> fields, so large scale *text ETL* will work well with this approach.
> This ticket makes two small changes to the topic() expression that makes 
> parallel ETL possible:
> 1) Changes the topic expression so it can operate in parallel.
> 2) Adds the initialCheckpoint parameter to the topic expression so a topic 
> can start pulling records from anywhere in the queue.
> Daemons can be sent to worker nodes that each work on processing a partition 
> of the data from the same topic. The daemon() function's natural behavior is 
> perfect for iteratively calling a topic until all records in the topic have 
> been processed.
> The sample code below pulls all records from one collection and indexes them 
> into another collection. A Transform function could be wrapped around the 
> topic() to transform the records before loading. Custom functions can also be 
> built to load the data in parallel to any outside system. 
> {code}
> parallel(
>          workerCollection, 
>          workers="2", 
>          sort="DaemonOp desc", 
>          daemon(
>                 update(
>                       updateCollection, 
>                       batchSize=200, 
>                       topic(
>                           checkpointCollection,
>                           topicCollection, 
>                           q=*:*, 
>                           id="topic1",
>                           fl="id, to , from, body", 
>                           partitionKeys="id",
>                           initialCheckpoint="0")), 
>                 runInterval="1000", 
>                 id="daemon1"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to