Here are my thoughts.. Last time, when Flink was brought up, we dug into the use-case and realized that having Flink/Beam support for windowing on physical/arrival time (hoodie_commit_time) would be valuable and that's why Flink was being proposed.
I would like to separate two aspects that I feel are intermingled here. 1) Writing datasets using Flink : Today hoodie-spark-datasource or deltastreamer tool all use Spark to write Hudi datasets. It would be nice if we can do this as a part of a Flink job as well. 2) Query Hudi datasets using Flink : we can perform awesome streaming style pipelines on top of Hudi, since it provided the _hoodie_commit_time arrival time watermarks.. Nick & I are trying to flesh this out more with motivating use-cases and make the case for doing this. Now questions for folks driving HUDI-184. Is the scope 1 or 2 or 1 & 2. ? My suggestion would be to tackle 1 in HUDI-184 and Nick/I can parallel tackle 2 This is exciting work :). Hope we can get past the current release, jar fixes and get to this.. ha ha. /thanks/vinoth On Wed, Jul 31, 2019 at 6:01 AM Semantic Beeng <[email protected]> wrote: > All, > > @vc and I have been mulling on this for a while and are working on some > material to start this. > > But > > 1. We want to start with requirements, right? > > Last time we discussed this we asked for use cases, needs etc. > > Have some here > https://cwiki.apache.org/confluence/display/HUDI/Hudi+for+Continuous+Deep+Analytics > . > > Taher - any news on that example application about trade reconciliation, > please? > > 2. Will push that we also drive this with proper architecture decisions to > map the choices in a principled way. > > This will also help users make sense of fit with their architectures. See > https://adr.github.io > > As architect consider that technology to technology integrations are bad > idea. > > Reminds us of the M to N integration (point to point) in enterprise > systems. > > Examples > > 1. > https://github.com/alibaba/flink-ai-extended/tree/master/flink-ml-tensorflow > > 2. https://github.com/yahoo/TensorFlowOnSpark > > And now imagine Hudi hard linked to Flink. > > Someone trying to use both Spark and TF for ML and and Flink for data > sliding would be in a tough spot to reconcile. > > And surely quite a few library version conflicts too. > > Instead we need to seek some abstractions in between them to decouple. > > Hence, the more use cases and design examples you provide the better. :-) > > @vc - thoughts? > > Kind regards > > Nick > > > > > > On July 31, 2019 at 8:06 AM Vinoth Chandar <[email protected]> wrote: > > > >>First of all, we should agree on the plan. > +100 . this will be a very involved process.. if we can get a plan agreed > upon, then we can start scoping the subtasks.. > > On Wed, Jul 31, 2019 at 2:11 AM Vinay Patil <[email protected]> > wrote: > > Hi Guys, > > Add me in this as well, missed out on this last time. > > Regards, > Vinay Patil > >
