Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

taher koitawala Sun, 04 Aug 2019 23:51:14 -0700

Sounds good. Let's do that first.

On Mon, Aug 5, 2019, 11:59 AM vino yang <yanghua1...@gmail.com> wrote:


> Hi Taher,
>
> IMO, Let's listen to more comments, after all, this discussion took place
> over the weekend. Then listen to Vinoth and the community's comments and
> suggestions.
>
> I personally think that design is more important. When we have a clear
> idea, it is not too late to create an issue.
>
> I am sorting out classes that depend on Spark. Maybe we can discuss how to
> decouple.
>
> What do you think?
>
> Best,
> Vino
>
> taher koitawala <taher...@gmail.com> 于2019年8月5日周一 下午2:17写道：
>
>> If everyone agrees that we should decouple Hudi and Spark to enable
>> processing engine abstraction. Should I open a jira ticket for that?
>>
>> On Sun, Aug 4, 2019 at 6:59 PM taher koitawala <taher...@gmail.com>
>> wrote:
>>
>>> If anyone wants to see a Flink Streaming pipeline here is a really small
>>> and basic Flink pipeline.
>>> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example
>>>
>>> Consider users playing a game across multiple platforms and we only get
>>> the timestamp, username and the current score as the record. The pipelines
>>> has a custom source function which produces this stream record.
>>>
>>> The pipeline does aggregations(Sum score of current window with the
>>> total score of the user) every 2 seconds based on the event time attached
>>> with the record.
>>>
>>> User's score keeps increasing as new windows are fired and new outputs
>>> are emitted. That's where Hudi fits as per my vision now, where Hudi
>>> intelligently shows only the latest records written.
>>>
>>>
>>>
>>> On Sun, Aug 4, 2019, 6:43 PM taher koitawala <taher...@gmail.com> wrote:
>>>
>>>> Fully agreed with Vino. I think let's chalk out the classes. Make
>>>> hierarchies and start decoupling everything. Then we can move forward with
>>>> the Flink and Beam streaming components.
>>>>
>>>> On Sun, Aug 4, 2019, 1:52 PM vino yang <yanghua1...@gmail.com> wrote:
>>>>
>>>>> Hi Nick,
>>>>>
>>>>> Thank you for your more detailed thoughts, and I fully agree with your
>>>>> thoughts about HudiLink, which should also be part of the long-term
>>>>> planning of the Hudi Ecology.
>>>>>
>>>>>
>>>>> *But I found that the angle of our thinking and the starting point are
>>>>> not consistent. I pay more attention to the rationality of the existing
>>>>> architecture and whether the dependence on the computing engine is
>>>>> pluggable. Don't get me wrong, I know very well that although we have
>>>>> different perspectives, these views have value for Hudi.*
>>>>> Let me give more details on the discussion I made earlier.
>>>>>
>>>>> Currently, multiple submodules of the Hudi project are tightly coupled
>>>>> to Spark's design and dependencies. You can see that many of the class
>>>>> files contain statements such as "import org.apache.spark.xxx".
>>>>>
>>>>> I first put forward a discussion: "Integrate Hudi with Apache Flink",
>>>>> and then came up with a discussion: "Decouple Hudi and Spark".
>>>>>
>>>>> I think the word "Integrate" I used for the first discussion may not
>>>>> be accurate enough. My intention is to make the computing engine used by
>>>>> Hudi pluggable. Spark is equivalent to Hudi is just a library, it is not
>>>>> the core of Hudi, it should not be strongly coupled with Hudi. The 
>>>>> features
>>>>> currently provided by Spark are also available from Flink. But in order to
>>>>> achieve this, we need to decouple Hudi from the code level with the use of
>>>>> Spark.
>>>>>
>>>>> This makes sense both in terms of structural rationality and community
>>>>> ecology.
>>>>>
>>>>> Best,
>>>>> Vino
>>>>>
>>>>>
>>>>> Semantic Beeng <n...@semanticbeeng.com> 于2019年8月4日周日 下午2:21写道：
>>>>>
>>>>>> "+1 for both Beam and Flink" - what I propose implies this indeed.
>>>>>>
>>>>>> But/and am working from the desired functionality and a proposed
>>>>>> design.
>>>>>>
>>>>>> (as opposed to starting with refactoring Hudi with the goal of close
>>>>>> integration with Flink)
>>>>>>
>>>>>> I feel this is not necessary - but am not an expert in Hudi
>>>>>> implementation.
>>>>>>
>>>>>> But am pretty sure it is not sufficient for the use cases I have in
>>>>>> mind. The gist is using Hudi as a file based data lake + ML feature store
>>>>>> that enables incremental analyses done with a combination of Flink, Beam,
>>>>>> Spark, Tensorlflow (see Petastorm from UberEng for an idea.)
>>>>>>
>>>>>> Let us call this HudiLink from now on (think of it as a mediator, not
>>>>>> another Hudi).
>>>>>>
>>>>>> The intuition behind looking at more then Flink is that both Beam and
>>>>>> Flink have good design abstractions we might reuse and extend.
>>>>>>
>>>>>> Like I said before, do not believe in point to point integrations.
>>>>>>
>>>>>> Alternatively / in parallel,If you care to share your use cases it
>>>>>> would be very useful. Working with explicit use cases helps others to
>>>>>> relate and help.
>>>>>>
>>>>>> Also, if some of you know there believe in (see) value of refactoring
>>>>>> Hudi implementation for a hard integration with Flink (but have no time 
>>>>>> to
>>>>>> argue for it) ofc you please go ahead.
>>>>>>
>>>>>> That may be a valid bottom up approach but I cannot relate to it
>>>>>> myself (due to lack of use cases).
>>>>>>
>>>>>> Working on a material on HudiLink - if any are interested I might
>>>>>> publish when more mature.
>>>>>>
>>>>>> Hint: this was part of the inspiration
>>>>>> https://eng.uber.com/michelangelo/
>>>>>>
>>>>>> One well thought use case will get you "in". :-) Kidding, ofc.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>>
>>>>>> On August 3, 2019 at 10:55 PM vino yang <yanghua1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> +1 for both Beam and Flink
>>>>>>
>>>>>> First step here is to probably draw out current hierrarchy and figure
>>>>>> out
>>>>>> what the abstraction points are..
>>>>>> In my opinion, the runtime (spark, flink) should be done at the
>>>>>> hoodie-client level and just used by hoodie-utilties seamlessly..
>>>>>>
>>>>>>
>>>>>> +1 for Vinoth's opinion, it should be the first step.
>>>>>>
>>>>>> No matter we hope Hudi to integrate with which computing framework.
>>>>>> We need to decouple Hudi client and Spark.
>>>>>>
>>>>>> We may need a pure client module named for example
>>>>>> hoodie-client-core(common)
>>>>>>
>>>>>> Then we could have: hoodie-client-spark, hoodie-client-flink and
>>>>>> hoodie-client-beam
>>>>>>
>>>>>> Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道：
>>>>>>
>>>>>> +1 for Beam -- agree with Semantic Beeng's analysis.
>>>>>>
>>>>>> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <taher...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> So the way to go around this is that file a hip. Chalk all th classes
>>>>>> our
>>>>>> and start moving towards Pure client.
>>>>>>
>>>>>> Secondly should we want to try beam?
>>>>>>
>>>>>> I think there is to much going on here and I'm not able to follow. If
>>>>>> we
>>>>>> want to try out beam all along I don't think it makes sense to do
>>>>>> anything on Flink then.
>>>>>>
>>>>>> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <n...@semanticbeeng.com>
>>>>>> wrote:
>>>>>>
>>>>>> >> +1 My money is on this approach.
>>>>>> >>
>>>>>> >> The existing abstractions from Beam seem enough for the use cases
>>>>>> as I
>>>>>> >> imagine them.
>>>>>> >>
>>>>>> >> Flink also has "dynamic table", "table source" and "table sink"
>>>>>> which
>>>>>> >> seem very useful abstractions where Hudi might fit nicely.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>>
>>>>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>>>>>> >>
>>>>>> >>
>>>>>> >> Attached a screen shot.
>>>>>> >>
>>>>>> >> This seems to fit with the original premise of Hudi as well.
>>>>>> >>
>>>>>> >> Am exploring this venue with a use case that involves "temporal
>>>>>> joins on
>>>>>> >> streams" which I need for feature extraction.
>>>>>> >>
>>>>>> >> Anyone is interested in this or has concrete enough needs and use
>>>>>> cases
>>>>>> >> please let me know.
>>>>>> >>
>>>>>> >> Best to go from an agreed upon set of 2-3 use cases.
>>>>>> >>
>>>>>> >> Cheers
>>>>>> >>
>>>>>> >> Nick
>>>>>> >>
>>>>>> >>
>>>>>> >> > Also, we do have some Beam experts on the mailing list.. Can you
>>>>>> please
>>>>>> >> weigh on viability of using Beam as the intermediate abstraction
>>>>>> here
>>>>>> >> between Spark/Flink?
>>>>>> >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
>>>>>> >> reduceByKey, countByKey and also does custom partitioning a lot.>
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>>

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

Reply via email to