[GitHub] [pulsar] JeffBolle opened a new issue, #20890: Add ElasticSearch BatchSource

via GitHub Wed, 26 Jul 2023 19:40:49 -0700


JeffBolle opened a new issue, #20890:
URL: https://github.com/apache/pulsar/issues/20890


   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Motivation
   
   ElasticSearch and Pulsar are very different types of data storage systems, 
however, there may be occasions where large amount of data that had been 
previously stored in ElasticSearch needs to be read into a Pulsar topic. We are 
working on transitioning a large data storage and search system which has tens 
of billions of records and hundreds of TB of data that would better serve the 
organization in Pulsar.  There are likely other use cases where data is being 
ingested to ElasticSearch but also needs to be loaded into Pulsar at scale.  
The goal of this is not change detection, but simply to run a query and stream 
the results to a topic in Pulsar in a highly parallel way that will allow for 
copying very large datasets to Pulsar topics.  
   
   ### Solution
   
   Implement a BatchSource that is able to leverage the existing 
ElasticSearchSink code to build an ElasticSearch client, conduct a search that 
can be parallelized and send the results to a Pulsar topic.  The motivation 
behind using the BatchSource is to use the `discover` phase to create and 
distribute the search slices, allowing the user to create a sliced search to be 
executed in parallel with multiple function runners. The user will be able to 
choose to run a Scroll or PIT search depending on what their ElasticSearch 
version supports.
   
   ### Alternatives
   
   _No response_
   
   ### Anything else?
   
   I'm working on the PR implementing this feature.
   
   ### Are you willing to submit a PR?
   
   - [X] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] JeffBolle opened a new issue, #20890: Add ElasticSearch BatchSource

Reply via email to