JeffBolle opened a new issue, #20890: URL: https://github.com/apache/pulsar/issues/20890
### Search before asking - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Motivation ElasticSearch and Pulsar are very different types of data storage systems, however, there may be occasions where large amount of data that had been previously stored in ElasticSearch needs to be read into a Pulsar topic. We are working on transitioning a large data storage and search system which has tens of billions of records and hundreds of TB of data that would better serve the organization in Pulsar. There are likely other use cases where data is being ingested to ElasticSearch but also needs to be loaded into Pulsar at scale. The goal of this is not change detection, but simply to run a query and stream the results to a topic in Pulsar in a highly parallel way that will allow for copying very large datasets to Pulsar topics. ### Solution Implement a BatchSource that is able to leverage the existing ElasticSearchSink code to build an ElasticSearch client, conduct a search that can be parallelized and send the results to a Pulsar topic. The motivation behind using the BatchSource is to use the `discover` phase to create and distribute the search slices, allowing the user to create a sliced search to be executed in parallel with multiple function runners. The user will be able to choose to run a Scroll or PIT search depending on what their ElasticSearch version supports. ### Alternatives _No response_ ### Anything else? I'm working on the PR implementing this feature. ### Are you willing to submit a PR? - [X] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
