Hi Thomas,

Good point. I have pretty similar thoughts. Current approach is if you get
your data into some staging area : kafka, files on dfs, then DeltaStreamer
can ingest them incrementally. For e.g the example here uses  Sqoop for
first leg and then DeltaStreamer/DataSource
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=128651008

Recent discussions have been about supporting new such intermediate staging
areas : pulsar, kinesis

Plus jdbc based native connector, that makes the sqoop + deltastreamer
approach single step, using the existing spark jdbc connector. Thats a good
direction to pursue IMO since we can potentially reuse any of the spark
datasources out there as long as we figure how to tail that source.
Thoughts?

https://spark-packages.org/?q=tags%3A%22data+source%22

If we can push on adding and certifying support more and more of these
connectors over time, that would be an awesome 😎 contribution to hudi


Thanks
Vinoth


On Sun, Sep 22, 2019 at 8:13 PM Thomas Weise <[email protected]> wrote:

> Hey,
>
> Seeing the discussions about adding new connectors to Hudi, makes me wonder
> if it would be possible to bridge existing connectors and their upstream
> projects with a sink that produces the input expected by Hudi instead.
>
> Reason I'm bringing this up is that it takes significant effort and usage
> in a sufficiently diverse set of use cases to build good connectors. If
> such investment of other communities could be leveraged, it might help
> Hudi.
>
> Thanks,
> Thomas
>

Reply via email to