Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

vino yang Sun, 04 Aug 2019 23:30:07 -0700

Hi Taher,

IMO, Let's listen to more comments, after all, this discussion took place
over the weekend. Then listen to Vinoth and the community's comments and
suggestions.


I personally think that design is more important. When we have a clear
idea, it is not too late to create an issue.

I am sorting out classes that depend on Spark. Maybe we can discuss how to
decouple.

What do you think?

Best,
Vino

taher koitawala <taher...@gmail.com> 于2019年8月5日周一 下午2:17写道：

> If everyone agrees that we should decouple Hudi and Spark to enable
> processing engine abstraction. Should I open a jira ticket for that?
>
> On Sun, Aug 4, 2019 at 6:59 PM taher koitawala <taher...@gmail.com> wrote:
>
>> If anyone wants to see a Flink Streaming pipeline here is a really small
>> and basic Flink pipeline.
>> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example
>>
>> Consider users playing a game across multiple platforms and we only get
>> the timestamp, username and the current score as the record. The pipelines
>> has a custom source function which produces this stream record.
>>
>> The pipeline does aggregations(Sum score of current window with the total
>> score of the user) every 2 seconds based on the event time attached with
>> the record.
>>
>> User's score keeps increasing as new windows are fired and new outputs
>> are emitted. That's where Hudi fits as per my vision now, where Hudi
>> intelligently shows only the latest records written.
>>
>>
>>
>> On Sun, Aug 4, 2019, 6:43 PM taher koitawala <taher...@gmail.com> wrote:
>>
>>> Fully agreed with Vino. I think let's chalk out the classes. Make
>>> hierarchies and start decoupling everything. Then we can move forward with
>>> the Flink and Beam streaming components.
>>>
>>> On Sun, Aug 4, 2019, 1:52 PM vino yang <yanghua1...@gmail.com> wrote:
>>>
>>>> Hi Nick,
>>>>
>>>> Thank you for your more detailed thoughts, and I fully agree with your
>>>> thoughts about HudiLink, which should also be part of the long-term
>>>> planning of the Hudi Ecology.
>>>>
>>>>
>>>> *But I found that the angle of our thinking and the starting point are
>>>> not consistent. I pay more attention to the rationality of the existing
>>>> architecture and whether the dependence on the computing engine is
>>>> pluggable. Don't get me wrong, I know very well that although we have
>>>> different perspectives, these views have value for Hudi.*
>>>> Let me give more details on the discussion I made earlier.
>>>>
>>>> Currently, multiple submodules of the Hudi project are tightly coupled
>>>> to Spark's design and dependencies. You can see that many of the class
>>>> files contain statements such as "import org.apache.spark.xxx".
>>>>
>>>> I first put forward a discussion: "Integrate Hudi with Apache Flink",
>>>> and then came up with a discussion: "Decouple Hudi and Spark".
>>>>
>>>> I think the word "Integrate" I used for the first discussion may not be
>>>> accurate enough. My intention is to make the computing engine used by Hudi
>>>> pluggable. Spark is equivalent to Hudi is just a library, it is not the
>>>> core of Hudi, it should not be strongly coupled with Hudi. The features
>>>> currently provided by Spark are also available from Flink. But in order to
>>>> achieve this, we need to decouple Hudi from the code level with the use of
>>>> Spark.
>>>>
>>>> This makes sense both in terms of structural rationality and community
>>>> ecology.
>>>>
>>>> Best,
>>>> Vino
>>>>
>>>>
>>>> Semantic Beeng <n...@semanticbeeng.com> 于2019年8月4日周日 下午2:21写道：
>>>>
>>>>> "+1 for both Beam and Flink" - what I propose implies this indeed.
>>>>>
>>>>> But/and am working from the desired functionality and a proposed
>>>>> design.
>>>>>
>>>>> (as opposed to starting with refactoring Hudi with the goal of close
>>>>> integration with Flink)
>>>>>
>>>>> I feel this is not necessary - but am not an expert in Hudi
>>>>> implementation.
>>>>>
>>>>> But am pretty sure it is not sufficient for the use cases I have in
>>>>> mind. The gist is using Hudi as a file based data lake + ML feature store
>>>>> that enables incremental analyses done with a combination of Flink, Beam,
>>>>> Spark, Tensorlflow (see Petastorm from UberEng for an idea.)
>>>>>
>>>>> Let us call this HudiLink from now on (think of it as a mediator, not
>>>>> another Hudi).
>>>>>
>>>>> The intuition behind looking at more then Flink is that both Beam and
>>>>> Flink have good design abstractions we might reuse and extend.
>>>>>
>>>>> Like I said before, do not believe in point to point integrations.
>>>>>
>>>>> Alternatively / in parallel,If you care to share your use cases it
>>>>> would be very useful. Working with explicit use cases helps others to
>>>>> relate and help.
>>>>>
>>>>> Also, if some of you know there believe in (see) value of refactoring
>>>>> Hudi implementation for a hard integration with Flink (but have no time to
>>>>> argue for it) ofc you please go ahead.
>>>>>
>>>>> That may be a valid bottom up approach but I cannot relate to it
>>>>> myself (due to lack of use cases).
>>>>>
>>>>> Working on a material on HudiLink - if any are interested I might
>>>>> publish when more mature.
>>>>>
>>>>> Hint: this was part of the inspiration
>>>>> https://eng.uber.com/michelangelo/
>>>>>
>>>>> One well thought use case will get you "in". :-) Kidding, ofc.
>>>>>
>>>>> Cheers
>>>>>
>>>>> Nick
>>>>>
>>>>>
>>>>> On August 3, 2019 at 10:55 PM vino yang <yanghua1...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> +1 for both Beam and Flink
>>>>>
>>>>> First step here is to probably draw out current hierrarchy and figure
>>>>> out
>>>>> what the abstraction points are..
>>>>> In my opinion, the runtime (spark, flink) should be done at the
>>>>> hoodie-client level and just used by hoodie-utilties seamlessly..
>>>>>
>>>>>
>>>>> +1 for Vinoth's opinion, it should be the first step.
>>>>>
>>>>> No matter we hope Hudi to integrate with which computing framework.
>>>>> We need to decouple Hudi client and Spark.
>>>>>
>>>>> We may need a pure client module named for example
>>>>> hoodie-client-core(common)
>>>>>
>>>>> Then we could have: hoodie-client-spark, hoodie-client-flink and
>>>>> hoodie-client-beam
>>>>>
>>>>> Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道：
>>>>>
>>>>> +1 for Beam -- agree with Semantic Beeng's analysis.
>>>>>
>>>>> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <taher...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> So the way to go around this is that file a hip. Chalk all th classes
>>>>> our
>>>>> and start moving towards Pure client.
>>>>>
>>>>> Secondly should we want to try beam?
>>>>>
>>>>> I think there is to much going on here and I'm not able to follow. If
>>>>> we
>>>>> want to try out beam all along I don't think it makes sense to do
>>>>> anything on Flink then.
>>>>>
>>>>> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <n...@semanticbeeng.com>
>>>>> wrote:
>>>>>
>>>>> >> +1 My money is on this approach.
>>>>> >>
>>>>> >> The existing abstractions from Beam seem enough for the use cases
>>>>> as I
>>>>> >> imagine them.
>>>>> >>
>>>>> >> Flink also has "dynamic table", "table source" and "table sink"
>>>>> which
>>>>> >> seem very useful abstractions where Hudi might fit nicely.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>>
>>>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>>>>> >>
>>>>> >>
>>>>> >> Attached a screen shot.
>>>>> >>
>>>>> >> This seems to fit with the original premise of Hudi as well.
>>>>> >>
>>>>> >> Am exploring this venue with a use case that involves "temporal
>>>>> joins on
>>>>> >> streams" which I need for feature extraction.
>>>>> >>
>>>>> >> Anyone is interested in this or has concrete enough needs and use
>>>>> cases
>>>>> >> please let me know.
>>>>> >>
>>>>> >> Best to go from an agreed upon set of 2-3 use cases.
>>>>> >>
>>>>> >> Cheers
>>>>> >>
>>>>> >> Nick
>>>>> >>
>>>>> >>
>>>>> >> > Also, we do have some Beam experts on the mailing list.. Can you
>>>>> please
>>>>> >> weigh on viability of using Beam as the intermediate abstraction
>>>>> here
>>>>> >> between Spark/Flink?
>>>>> >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
>>>>> >> reduceByKey, countByKey and also does custom partitioning a lot.>
>>>>> >>
>>>>> >> >
>>>>> >>
>>>>> >
>>>>>
>>>>>

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

Reply via email to