Hi Thomas, Good point. I have pretty similar thoughts. Current approach is if you get your data into some staging area : kafka, files on dfs, then DeltaStreamer can ingest them incrementally. For e.g the example here uses Sqoop for first leg and then DeltaStreamer/DataSource https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=128651008
Recent discussions have been about supporting new such intermediate staging areas : pulsar, kinesis Plus jdbc based native connector, that makes the sqoop + deltastreamer approach single step, using the existing spark jdbc connector. Thats a good direction to pursue IMO since we can potentially reuse any of the spark datasources out there as long as we figure how to tail that source. Thoughts? https://spark-packages.org/?q=tags%3A%22data+source%22 If we can push on adding and certifying support more and more of these connectors over time, that would be an awesome 😎 contribution to hudi Thanks Vinoth On Sun, Sep 22, 2019 at 8:13 PM Thomas Weise <[email protected]> wrote: > Hey, > > Seeing the discussions about adding new connectors to Hudi, makes me wonder > if it would be possible to bridge existing connectors and their upstream > projects with a sink that produces the input expected by Hudi instead. > > Reason I'm bringing this up is that it takes significant effort and usage > in a sufficiently diverse set of use cases to build good connectors. If > such investment of other communities could be leveraged, it might help > Hudi. > > Thanks, > Thomas >
