Hi Taher, IMO, Let's listen to more comments, after all, this discussion took place over the weekend. Then listen to Vinoth and the community's comments and suggestions.
I personally think that design is more important. When we have a clear idea, it is not too late to create an issue. I am sorting out classes that depend on Spark. Maybe we can discuss how to decouple. What do you think? Best, Vino taher koitawala <taher...@gmail.com> 于2019年8月5日周一 下午2:17写道: > If everyone agrees that we should decouple Hudi and Spark to enable > processing engine abstraction. Should I open a jira ticket for that? > > On Sun, Aug 4, 2019 at 6:59 PM taher koitawala <taher...@gmail.com> wrote: > >> If anyone wants to see a Flink Streaming pipeline here is a really small >> and basic Flink pipeline. >> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example >> >> Consider users playing a game across multiple platforms and we only get >> the timestamp, username and the current score as the record. The pipelines >> has a custom source function which produces this stream record. >> >> The pipeline does aggregations(Sum score of current window with the total >> score of the user) every 2 seconds based on the event time attached with >> the record. >> >> User's score keeps increasing as new windows are fired and new outputs >> are emitted. That's where Hudi fits as per my vision now, where Hudi >> intelligently shows only the latest records written. >> >> >> >> On Sun, Aug 4, 2019, 6:43 PM taher koitawala <taher...@gmail.com> wrote: >> >>> Fully agreed with Vino. I think let's chalk out the classes. Make >>> hierarchies and start decoupling everything. Then we can move forward with >>> the Flink and Beam streaming components. >>> >>> On Sun, Aug 4, 2019, 1:52 PM vino yang <yanghua1...@gmail.com> wrote: >>> >>>> Hi Nick, >>>> >>>> Thank you for your more detailed thoughts, and I fully agree with your >>>> thoughts about HudiLink, which should also be part of the long-term >>>> planning of the Hudi Ecology. >>>> >>>> >>>> *But I found that the angle of our thinking and the starting point are >>>> not consistent. I pay more attention to the rationality of the existing >>>> architecture and whether the dependence on the computing engine is >>>> pluggable. Don't get me wrong, I know very well that although we have >>>> different perspectives, these views have value for Hudi.* >>>> Let me give more details on the discussion I made earlier. >>>> >>>> Currently, multiple submodules of the Hudi project are tightly coupled >>>> to Spark's design and dependencies. You can see that many of the class >>>> files contain statements such as "import org.apache.spark.xxx". >>>> >>>> I first put forward a discussion: "Integrate Hudi with Apache Flink", >>>> and then came up with a discussion: "Decouple Hudi and Spark". >>>> >>>> I think the word "Integrate" I used for the first discussion may not be >>>> accurate enough. My intention is to make the computing engine used by Hudi >>>> pluggable. Spark is equivalent to Hudi is just a library, it is not the >>>> core of Hudi, it should not be strongly coupled with Hudi. The features >>>> currently provided by Spark are also available from Flink. But in order to >>>> achieve this, we need to decouple Hudi from the code level with the use of >>>> Spark. >>>> >>>> This makes sense both in terms of structural rationality and community >>>> ecology. >>>> >>>> Best, >>>> Vino >>>> >>>> >>>> Semantic Beeng <n...@semanticbeeng.com> 于2019年8月4日周日 下午2:21写道: >>>> >>>>> "+1 for both Beam and Flink" - what I propose implies this indeed. >>>>> >>>>> But/and am working from the desired functionality and a proposed >>>>> design. >>>>> >>>>> (as opposed to starting with refactoring Hudi with the goal of close >>>>> integration with Flink) >>>>> >>>>> I feel this is not necessary - but am not an expert in Hudi >>>>> implementation. >>>>> >>>>> But am pretty sure it is not sufficient for the use cases I have in >>>>> mind. The gist is using Hudi as a file based data lake + ML feature store >>>>> that enables incremental analyses done with a combination of Flink, Beam, >>>>> Spark, Tensorlflow (see Petastorm from UberEng for an idea.) >>>>> >>>>> Let us call this HudiLink from now on (think of it as a mediator, not >>>>> another Hudi). >>>>> >>>>> The intuition behind looking at more then Flink is that both Beam and >>>>> Flink have good design abstractions we might reuse and extend. >>>>> >>>>> Like I said before, do not believe in point to point integrations. >>>>> >>>>> Alternatively / in parallel,If you care to share your use cases it >>>>> would be very useful. Working with explicit use cases helps others to >>>>> relate and help. >>>>> >>>>> Also, if some of you know there believe in (see) value of refactoring >>>>> Hudi implementation for a hard integration with Flink (but have no time to >>>>> argue for it) ofc you please go ahead. >>>>> >>>>> That may be a valid bottom up approach but I cannot relate to it >>>>> myself (due to lack of use cases). >>>>> >>>>> Working on a material on HudiLink - if any are interested I might >>>>> publish when more mature. >>>>> >>>>> Hint: this was part of the inspiration >>>>> https://eng.uber.com/michelangelo/ >>>>> >>>>> One well thought use case will get you "in". :-) Kidding, ofc. >>>>> >>>>> Cheers >>>>> >>>>> Nick >>>>> >>>>> >>>>> On August 3, 2019 at 10:55 PM vino yang <yanghua1...@gmail.com> wrote: >>>>> >>>>> >>>>> +1 for both Beam and Flink >>>>> >>>>> First step here is to probably draw out current hierrarchy and figure >>>>> out >>>>> what the abstraction points are.. >>>>> In my opinion, the runtime (spark, flink) should be done at the >>>>> hoodie-client level and just used by hoodie-utilties seamlessly.. >>>>> >>>>> >>>>> +1 for Vinoth's opinion, it should be the first step. >>>>> >>>>> No matter we hope Hudi to integrate with which computing framework. >>>>> We need to decouple Hudi client and Spark. >>>>> >>>>> We may need a pure client module named for example >>>>> hoodie-client-core(common) >>>>> >>>>> Then we could have: hoodie-client-spark, hoodie-client-flink and >>>>> hoodie-client-beam >>>>> >>>>> Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道: >>>>> >>>>> +1 for Beam -- agree with Semantic Beeng's analysis. >>>>> >>>>> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <taher...@gmail.com> >>>>> wrote: >>>>> >>>>> So the way to go around this is that file a hip. Chalk all th classes >>>>> our >>>>> and start moving towards Pure client. >>>>> >>>>> Secondly should we want to try beam? >>>>> >>>>> I think there is to much going on here and I'm not able to follow. If >>>>> we >>>>> want to try out beam all along I don't think it makes sense to do >>>>> anything on Flink then. >>>>> >>>>> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <n...@semanticbeeng.com> >>>>> wrote: >>>>> >>>>> >> +1 My money is on this approach. >>>>> >> >>>>> >> The existing abstractions from Beam seem enough for the use cases >>>>> as I >>>>> >> imagine them. >>>>> >> >>>>> >> Flink also has "dynamic table", "table source" and "table sink" >>>>> which >>>>> >> seem very useful abstractions where Hudi might fit nicely. >>>>> >> >>>>> >> >>>>> >> >>>>> >>>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html >>>>> >> >>>>> >> >>>>> >> Attached a screen shot. >>>>> >> >>>>> >> This seems to fit with the original premise of Hudi as well. >>>>> >> >>>>> >> Am exploring this venue with a use case that involves "temporal >>>>> joins on >>>>> >> streams" which I need for feature extraction. >>>>> >> >>>>> >> Anyone is interested in this or has concrete enough needs and use >>>>> cases >>>>> >> please let me know. >>>>> >> >>>>> >> Best to go from an agreed upon set of 2-3 use cases. >>>>> >> >>>>> >> Cheers >>>>> >> >>>>> >> Nick >>>>> >> >>>>> >> >>>>> >> > Also, we do have some Beam experts on the mailing list.. Can you >>>>> please >>>>> >> weigh on viability of using Beam as the intermediate abstraction >>>>> here >>>>> >> between Spark/Flink? >>>>> >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition, >>>>> >> reduceByKey, countByKey and also does custom partitioning a lot.> >>>>> >> >>>>> >> > >>>>> >> >>>>> > >>>>> >>>>>