Re: [DISCUSS] Decouple Hudi and Spark

Taher Koitawala Thu, 19 Sep 2019 23:07:13 -0700

Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex with
HeapState, RocksDBState and FsState but on Spark.


On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <taher...@gmail.com> wrote:

> Hi Vinoth,
>                Having seen the doc and code. I understand the
> HoodieBloomIndex mainly caches key and partition path. Can we address how
> Flink does it? Like, have  HeapState where the user chooses to cache the
> Index on heap, RockDBState where indexes are written to RocksDB and finally
> FsState where indexes can be written to HDFS, S3, Azure Fs. And on top, we
> can do an index Time To Live.
>
> Regards,
> Taher Koitawala
>
> On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar <vin...@apache.org> wrote:
>
>> I still feel the key thing here is reimplementing HoodieBloomIndex without
>> needing spark caching.
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global)
>>  documents the spark DAG in detail.
>>
>> If everyone feels, it's best for me to scope the work out, then happy to
>> do
>> it!
>>
>> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <taher...@gmail.com>
>> wrote:
>>
>> > Guys I think we are slowing down on this again. We need to start
>> planning
>> > small small tasks towards this VC please can you help fast track this?
>> >
>> > Regards,
>> > Taher Koitawala
>> >
>> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <vin...@apache.org>
>> wrote:
>> >
>> > > Look forward to the analysis. A key class to read would be
>> > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
>> > >
>> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <yanghua1...@gmail.com>
>> wrote:
>> > >
>> > > > >> Currently Spark Streaming micro batching fits well with Hudi,
>> since
>> > it
>> > > > amortizes the cost of indexing, workload profiling etc. 1 spark
>> micro
>> > > batch
>> > > > = 1 hudi commit
>> > > > With the per-record model in Flink, I am not sure how useful it
>> will be
>> > > to
>> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
>> will
>> > > be
>> > > > inefficient..
>> > > >
>> > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
>> About
>> > > > Flink streaming, we can also implement the "batch" and "micro-batch"
>> > > model
>> > > > when process data. For example:
>> > > >
>> > > >    - aggregation: use flexibility window mechanism;
>> > > >    - non-aggregation: use Flink stateful state API cache a batch
>> data
>> > > >
>> > > >
>> > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
>> full
>> > > > summary of how Spark is being used in a wiki page is a good start
>> IMO.
>> > We
>> > > > can then hash out what can be generalized and what cannot be and
>> needs
>> > to
>> > > > be left in hudi-client-spark vs hudi-client-core
>> > > >
>> > > > agree
>> > > >
>> > > > Vinoth Chandar <vin...@apache.org> 于2019年8月14日周三 上午8:35写道：
>> > > >
>> > > > > >> We should only stick to Flink Streaming. Furthermore if there
>> is a
>> > > > > requirement for batch then users
>> > > > > >> should use Spark or then we will anyway have a beam integration
>> > > coming
>> > > > > up.
>> > > > >
>> > > > > Currently Spark Streaming micro batching fits well with Hudi,
>> since
>> > it
>> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
>> micro
>> > > > batch
>> > > > > = 1 hudi commit
>> > > > > With the per-record model in Flink, I am not sure how useful it
>> will
>> > be
>> > > > to
>> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
>> > will
>> > > > be
>> > > > > inefficient..
>> > > > >
>> > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
>> full
>> > > > > summary of how Spark is being used in a wiki page is a good start
>> > IMO.
>> > > We
>> > > > > can then hash out what can be generalized and what cannot be and
>> > needs
>> > > to
>> > > > > be left in hudi-client-spark vs hudi-client-core
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <yanghua1...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Hi Nick and Taher,
>> > > > > >
>> > > > > > I just want to answer Nishith's question. Reference his old
>> > > description
>> > > > > > here:
>> > > > > >
>> > > > > > > You can do a parallel investigation while we are deciding on
>> the
>> > > > module
>> > > > > > structure.  You could be looking at all the patterns in Hudi's
>> > Spark
>> > > > APIs
>> > > > > > usage (RDD/DataSource/SparkContext) and see if such support can
>> be
>> > > > > achieved
>> > > > > > in theory with Flink. If not, what is the workaround.
>> Documenting
>> > > such
>> > > > > > patterns would be valuable when multiple engineers are working
>> on
>> > it.
>> > > > For
>> > > > > > e:g, Hudi relies on     (a) custom partitioning logic for
>> upserts,
>> > > > >  (b)
>> > > > > > caching RDDs to avoid reruns of costly stages     (c) A Spark
>> > upsert
>> > > > task
>> > > > > > knowing its spark partition/task/attempt ids
>> > > > > >
>> > > > > > And just like the title of this thread, we are going to try to
>> > > decouple
>> > > > > > Hudi and Spark. That means we can run the whole Hudi without
>> > > depending
>> > > > > > Spark. So we need to analyze all the usage of Spark in Hudi.
>> > > > > >
>> > > > > > Here we are not discussing the integration of Hudi and Flink in
>> the
>> > > > > > application layer. Instead, I want Hudi to be decoupled from
>> Spark
>> > > and
>> > > > > > allow other engines (such as Flink) to replace Spark.
>> > > > > >
>> > > > > > It can be divided into long-term goals and short-term goals. As
>> > > Nishith
>> > > > > > stated in a recent email.
>> > > > > >
>> > > > > > I mentioned the Flink Batch API here because Hudi can connect
>> with
>> > > many
>> > > > > > different Source/Sinks. Some file-based reads are not
>> appropriate
>> > for
>> > > > > Flink
>> > > > > > Streaming.
>> > > > > >
>> > > > > > Therefore, this is a comprehensive survey of the use of Spark in
>> > > Hudi.
>> > > > > >
>> > > > > > Best,
>> > > > > > Vino
>> > > > > >
>> > > > > >
>> > > > > > taher koitawala <taher...@gmail.com> 于2019年8月13日周二 下午5:43写道：
>> > > > > >
>> > > > > > > Hi Vino,
>> > > > > > >       According to what I've seen Hudi has a lot of spark
>> > component
>> > > > > > flowing
>> > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts etc. The
>> main
>> > > > > classes I
>> > > > > > > guess we should focus upon is HoodieTable and Hoodie write
>> > clients.
>> > > > > > >
>> > > > > > > Also Vino, I don't think we should be providing Flink dataset
>> > > > > > > implementation. We should only stick to Flink Streaming.
>> > > > > > >                Furthermore if there is a requirement for batch
>> > then
>> > > > > users
>> > > > > > > should use Spark or then we will anyway have a beam
>> integration
>> > > > coming
>> > > > > > up.
>> > > > > > >
>> > > > > > > As of cache, How about we write our stateful Flink function
>> and
>> > use
>> > > > > > > RocksDbStateBackend with some state TTL.
>> > > > > > >
>> > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang <
>> yanghua1...@gmail.com>
>> > > > wrote:
>> > > > > > >
>> > > > > > > > Hi all,
>> > > > > > > >
>> > > > > > > > After doing some research, let me share my information:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >    - Limitation of computing engine capabilities: Hudi uses
>> > > Spark's
>> > > > > > > >    RDD#persist, and Flink currently has no API to cache
>> > datasets.
>> > > > > Maybe
>> > > > > > > we
>> > > > > > > > can
>> > > > > > > >    only choose to use external storage or do not use cache?
>> For
>> > > the
>> > > > > use
>> > > > > > > of
>> > > > > > > >    other APIs, the two currently offer almost equivalent
>> > > > > capabilities.
>> > > > > > > >    - The abstraction of the computing engine is different:
>> > > > > Considering
>> > > > > > > the
>> > > > > > > >    different usage scenarios of the computing engine in
>> Hudi,
>> > > Flink
>> > > > > has
>> > > > > > > not
>> > > > > > > >    yet implemented stream batch unification, so we may use
>> both
>> > > > > Flink's
>> > > > > > > >    DataSet API (batch processing) and DataStream API (stream
>> > > > > > processing).
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > Vino
>> > > > > > > >
>> > > > > > > > nishith agarwal <n3.nas...@gmail.com> 于2019年8月8日周四
>> 上午12:57写道：
>> > > > > > > >
>> > > > > > > > > Nick,
>> > > > > > > > >
>> > > > > > > > > You bring up a good point about the non-trivial
>> programming
>> > > model
>> > > > > > > > > differences between these different technologies. From a
>> > > > > theoretical
>> > > > > > > > > perspective, I'd say considering a higher level
>> abstraction
>> > > makes
>> > > > > > > sense.
>> > > > > > > > I
>> > > > > > > > > think we have to decouple some objectives and concerns
>> here.
>> > > > > > > > >
>> > > > > > > > > a) The immediate desire is to have Hudi be able to run on
>> a
>> > > Flink
>> > > > > (or
>> > > > > > > > > non-spark) engine. This naturally begs the question of
>> > > decoupling
>> > > > > > Hudi
>> > > > > > > > > concepts from direct Spark dependencies.
>> > > > > > > > >
>> > > > > > > > > b) If we do want to initiate the above effort, would it
>> make
>> > > > sense
>> > > > > to
>> > > > > > > > just
>> > > > > > > > > have a higher level abstraction, building on other
>> > technologies
>> > > > > like
>> > > > > > > beam
>> > > > > > > > > (euphoria etc) and provide single, clean API's that may be
>> > more
>> > > > > > > > > maintainable from a code perspective. But at the same time
>> > this
>> > > > > will
>> > > > > > > > > introduce challenges on how to maintain efficiency and
>> > > optimized
>> > > > > > > runtime
>> > > > > > > > > dags for Hudi (since the code would move away from point
>> > > > > integrations
>> > > > > > > and
>> > > > > > > > > whenever this happens, tuning natively for specific
>> engines
>> > > > becomes
>> > > > > > > more
>> > > > > > > > > and more difficult).
>> > > > > > > > >
>> > > > > > > > > My general opinion is that, as the community grows over
>> time
>> > > with
>> > > > > > more
>> > > > > > > > > folks having an in-depth understanding of Hudi, going from
>> > > > > > > current_state
>> > > > > > > > ->
>> > > > > > > > > (a) -> (b) might be the most reliable and adoptable path
>> for
>> > > this
>> > > > > > > > project.
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > > Nishith
>> > > > > > > > >
>> > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng <
>> > > > > > n...@semanticbeeng.com>
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > There are some not trivial difference between
>> programming
>> > > model
>> > > > > and
>> > > > > > > > > > runtime semantics between Beam, Spark and Flink.
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
>> > > > > > > > > >
>> > > > > > > > > > Nitish, Vino - thoughts?
>> > > > > > > > > >
>> > > > > > > > > > Does it feel to consider a higher level abstraction /
>> DSL
>> > > > instead
>> > > > > > of
>> > > > > > > > > > maintaining different code with same functionality but
>> > > > different
>> > > > > > > > > > programming models ?
>> > > > > > > > > >
>> > > > > > > > > >
>> https://beam.apache.org/documentation/sdks/java/euphoria/
>> > > > > > > > > >
>> > > > > > > > > > Nick
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On August 6, 2019 at 4:04 PM nishith agarwal <
>> > > > > n3.nas...@gmail.com>
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > +1 for Approach 1 Point integration with each framework.
>> > > > > > > > > >
>> > > > > > > > > > Pros for point integration
>> > > > > > > > > >
>> > > > > > > > > >    - Hudi community is already familiar with spark and
>> > spark
>> > > > > based
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > actions/shuffles etc. Since both modules can be
>> decoupled,
>> > > this
>> > > > > > > enables
>> > > > > > > > > us
>> > > > > > > > > > to have a steady release for Hudi for 1 execution engine
>> > > > (spark)
>> > > > > > > while
>> > > > > > > > we
>> > > > > > > > > > hone our skills and iterate on making flink dag
>> optimized,
>> > > > > > performant
>> > > > > > > > > with
>> > > > > > > > > > the right configuration.
>> > > > > > > > > >
>> > > > > > > > > >    - This might be a stepping stone towards rewriting
>> the
>> > > > entire
>> > > > > > code
>> > > > > > > > > base
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > being agnostic of spark/flink. This approach will help
>> us
>> > fix
>> > > > > > tests,
>> > > > > > > > > > intricacies and help make the code base ready for a
>> larger
>> > > > > rework.
>> > > > > > > > > >
>> > > > > > > > > >    - Seems like the easiest way to add flink support
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > Cons
>> > > > > > > > > >
>> > > > > > > > > >    - More code paths to maintain and reason since the
>> spark
>> > > and
>> > > > > > flink
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > integrations will naturally diverge over time.
>> > > > > > > > > >
>> > > > > > > > > > Theoretically, I do like the idea of being able to run
>> the
>> > > hudi
>> > > > > dag
>> > > > > > > on
>> > > > > > > > > beam
>> > > > > > > > > > more than point integrations, where there is one
>> API/logic
>> > to
>> > > > > > reason
>> > > > > > > > > about.
>> > > > > > > > > > But practically, that may not be the right direction.
>> > > > > > > > > >
>> > > > > > > > > > Pros
>> > > > > > > > > >
>> > > > > > > > > >    - Lesser cognitive burden in maintaining, evolving
>> and
>> > > > > releasing
>> > > > > > > the
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > project with one API to reason with.
>> > > > > > > > > >
>> > > > > > > > > >    - Theoretically, going forward assuming beam is
>> adopted
>> > > as a
>> > > > > > > > standard
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > programming paradigm for stream/batch, this would enable
>> > > > > consumers
>> > > > > > > > > leverage
>> > > > > > > > > > the power of hudi more easily.
>> > > > > > > > > >
>> > > > > > > > > > Cons
>> > > > > > > > > >
>> > > > > > > > > >    - Massive rewrite of the code base. Additionally,
>> since
>> > we
>> > > > > would
>> > > > > > > > have
>> > > > > > > > > >    moved
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > away from directly using spark APIs, there is a bigger
>> risk
>> > > of
>> > > > > > > > > regression.
>> > > > > > > > > > We would have to be very thorough with all the
>> intricacies
>> > > and
>> > > > > > ensure
>> > > > > > > > the
>> > > > > > > > > > same stability of new releases.
>> > > > > > > > > >
>> > > > > > > > > >    - Managing future features (which may be very spark
>> > > driven)
>> > > > > will
>> > > > > > > > > either
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > clash or pause or will need to be reworked.
>> > > > > > > > > >
>> > > > > > > > > >    - Tuning jobs for Spark/Flink type execution
>> frameworks
>> > > > > > > individually
>> > > > > > > > > >    might
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > be difficult and will get difficult over time as the
>> > project
>> > > > > > evolves,
>> > > > > > > > > where
>> > > > > > > > > > some beam integrations with spark/flink may not work as
>> > > > expected.
>> > > > > > > > > >
>> > > > > > > > > >    - Also, as pointed above, need to probably support
>> the
>> > > > > > > hoodie-spark
>> > > > > > > > > >    module
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > as a first-class.
>> > > > > > > > > >
>> > > > > > > > > > Thank,
>> > > > > > > > > > Nishith
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala <
>> > > > > taher...@gmail.com
>> > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > Hi Vinoth,
>> > > > > > > > > > Are there some tasks I can take up to ramp up the code?
>> > Want
>> > > to
>> > > > > get
>> > > > > > > > > > more used to the code and understand the existing
>> > > > implementation
>> > > > > > > > better.
>> > > > > > > > > >
>> > > > > > > > > > Thanks,
>> > > > > > > > > > Taher Koitawala
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar <
>> > > > vin...@apache.org>
>> > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > Let's see if others have any thoughts as well. We can
>> plan
>> > to
>> > > > fix
>> > > > > > the
>> > > > > > > > > > approach by EOW.
>> > > > > > > > > >
>> > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <
>> > > > yanghua1...@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > Hi guys,
>> > > > > > > > > >
>> > > > > > > > > > Also, +1 for Approach 1 like Taher.
>> > > > > > > > > >
>> > > > > > > > > > If we can do a comprehensive analysis of this model and
>> > come
>> > > up
>> > > > > > with.
>> > > > > > > > > >
>> > > > > > > > > > means
>> > > > > > > > > >
>> > > > > > > > > > to refactor this cleanly, this would be promising.
>> > > > > > > > > >
>> > > > > > > > > > Yes, when we get the conclusion, we could start this
>> work.
>> > > > > > > > > >
>> > > > > > > > > > Best,
>> > > > > > > > > > Vino
>> > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月6日周二
>> > 上午12:28写道：
>> > > > > > > > > >
>> > > > > > > > > > +1 for Approch 1 Point integration with each framework
>> > > > > > > > > >
>> > > > > > > > > > Approach 2 has a problem as you said "Developers need to
>> > > think
>> > > > > > about
>> > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in
>> > the
>> > > > end,
>> > > > > > > > > >
>> > > > > > > > > > this
>> > > > > > > > > >
>> > > > > > > > > > may
>> > > > > > > > > >
>> > > > > > > > > > not be the panacea that it seems to be"
>> > > > > > > > > >
>> > > > > > > > > > We have seen various pipelines in the beam dag being
>> > > expressed
>> > > > > > > > > >
>> > > > > > > > > > differently
>> > > > > > > > > >
>> > > > > > > > > > then we had them in our original usecase. And also
>> > switching
>> > > > > > between
>> > > > > > > > > >
>> > > > > > > > > > spark
>> > > > > > > > > >
>> > > > > > > > > > and Flink runners in beam have various impact on the
>> > > pipelines
>> > > > > like
>> > > > > > > > > >
>> > > > > > > > > > some
>> > > > > > > > > >
>> > > > > > > > > > features available in Flink are not available on the
>> spark
>> > > > runner
>> > > > > > > > > >
>> > > > > > > > > > etc.
>> > > > > > > > > >
>> > > > > > > > > > Refer to this compatible matrix ->
>> > > > > > > > > >
>> > > > https://beam.apache.org/documentation/runners/capability-matrix/
>> > > > > > > > > >
>> > > > > > > > > > Hence my vote on Approch 1 let's decouple and build the
>> > > > abstract
>> > > > > > for
>> > > > > > > > > >
>> > > > > > > > > > each
>> > > > > > > > > >
>> > > > > > > > > > framework. That is a much better option. We will also
>> have
>> > > more
>> > > > > > > > > >
>> > > > > > > > > > control
>> > > > > > > > > >
>> > > > > > > > > > over each framework's implement.
>> > > > > > > > > >
>> > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <
>> > > vin...@apache.org
>> > > > >
>> > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > Would like to highlight that there are two distinct
>> > > approaches
>> > > > > here
>> > > > > > > > > >
>> > > > > > > > > > with
>> > > > > > > > > >
>> > > > > > > > > > different tradeoffs. Think of this as my braindump, as I
>> > have
>> > > > > been
>> > > > > > > > > >
>> > > > > > > > > > thinking
>> > > > > > > > > >
>> > > > > > > > > > about this quite a bit in the past.
>> > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > *Approach 1 : Point integration with each framework *
>> > > > > > > > > >
>> > > > > > > > > > We may need a pure client module named for example
>> > > > > > > > > > hoodie-client-core(common)
>> > > > > > > > > > >> Then we could have: hoodie-client-spark,
>> > > hoodie-client-flink
>> > > > > > > > > > and hoodie-client-beam
>> > > > > > > > > >
>> > > > > > > > > > (+) This is the safest to do IMO, since we can isolate
>> the
>> > > > > current
>> > > > > > > > > >
>> > > > > > > > > > Spark
>> > > > > > > > > >
>> > > > > > > > > > execution (hoodie-spark, hoodie-client-spark) from the
>> > > changes
>> > > > > for
>> > > > > > > > > >
>> > > > > > > > > > flink,
>> > > > > > > > > >
>> > > > > > > > > > while it stabilizes over few releases.
>> > > > > > > > > > (-) Downside is that the utilities needs to be redone :
>> > > > > > > > > > hoodie-utilities-spark and hoodie-utilities-flink and
>> > > > > > > > > > hoodie-utilities-core ? hoodie-cli?
>> > > > > > > > > >
>> > > > > > > > > > If we can do a comprehensive analysis of this model and
>> > come
>> > > up
>> > > > > > > > > >
>> > > > > > > > > > with.
>> > > > > > > > > >
>> > > > > > > > > > means
>> > > > > > > > > >
>> > > > > > > > > > to refactor this cleanly, this would be promising.
>> > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > *Approach 2: Beam as the compute abstraction*
>> > > > > > > > > >
>> > > > > > > > > > Another more drastic approach is to remove Spark as the
>> > > compute
>> > > > > > > > > >
>> > > > > > > > > > abstraction
>> > > > > > > > > >
>> > > > > > > > > > for writing data and replace it with Beam.
>> > > > > > > > > >
>> > > > > > > > > > (+) All of the code remains more or less similar and
>> there
>> > is
>> > > > one
>> > > > > > > > > >
>> > > > > > > > > > compute
>> > > > > > > > > >
>> > > > > > > > > > API to reason about.
>> > > > > > > > > >
>> > > > > > > > > > (-) The (very big) assumption here is that we are able
>> to
>> > > tune
>> > > > > the
>> > > > > > > > > >
>> > > > > > > > > > spark
>> > > > > > > > > >
>> > > > > > > > > > runtime the same way using Beam : custom partitioners,
>> > > support
>> > > > > for
>> > > > > > > > > >
>> > > > > > > > > > all
>> > > > > > > > > >
>> > > > > > > > > > RDD
>> > > > > > > > > >
>> > > > > > > > > > operations we invoke, caching etc etc.
>> > > > > > > > > > (-) It will be a massive rewrite and testing of such a
>> > large
>> > > > > > > > > >
>> > > > > > > > > > rewrite
>> > > > > > > > > >
>> > > > > > > > > > would
>> > > > > > > > > >
>> > > > > > > > > > also be really challenging, since we need to pay
>> attention
>> > to
>> > > > all
>> > > > > > > > > >
>> > > > > > > > > > intricate
>> > > > > > > > > >
>> > > > > > > > > > details to ensure the spark users today experience no
>> > > > > > > > > > regressions/side-effects
>> > > > > > > > > > (-) Note that we still need to probably support the
>> > > > hoodie-spark
>> > > > > > > > > >
>> > > > > > > > > > module
>> > > > > > > > > >
>> > > > > > > > > > and
>> > > > > > > > > >
>> > > > > > > > > > may be a first-class such integration with flink, for
>> > native
>> > > > > > > > > >
>> > > > > > > > > > flink/spark
>> > > > > > > > > >
>> > > > > > > > > > pipeline authoring. Users of say DeltaStreamer need to
>> pass
>> > > in
>> > > > > > > > > >
>> > > > > > > > > > Spark
>> > > > > > > > > >
>> > > > > > > > > > or
>> > > > > > > > > >
>> > > > > > > > > > Flink configs anyway.. Developers need to think about
>> > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in
>> > the
>> > > > end,
>> > > > > > > > > >
>> > > > > > > > > > this
>> > > > > > > > > >
>> > > > > > > > > > may
>> > > > > > > > > >
>> > > > > > > > > > not be the panacea that it seems to be.
>> > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > One goal for the HIP is to get us all to agree as a
>> > community
>> > > > > which
>> > > > > > > > > >
>> > > > > > > > > > one
>> > > > > > > > > >
>> > > > > > > > > > to
>> > > > > > > > > >
>> > > > > > > > > > pick, with sufficient investigation, testing,
>> > benchmarking..
>> > > > > > > > > >
>> > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <
>> > > > yanghua1...@gmail.com>
>> > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > +1 for both Beam and Flink
>> > > > > > > > > >
>> > > > > > > > > > First step here is to probably draw out current
>> hierrarchy
>> > > and
>> > > > > > > > > >
>> > > > > > > > > > figure
>> > > > > > > > > >
>> > > > > > > > > > out
>> > > > > > > > > >
>> > > > > > > > > > what the abstraction points are..
>> > > > > > > > > > In my opinion, the runtime (spark, flink) should be
>> done at
>> > > the
>> > > > > > > > > > hoodie-client level and just used by hoodie-utilties
>> > > > > > > > > >
>> > > > > > > > > > seamlessly..
>> > > > > > > > > >
>> > > > > > > > > > +1 for Vinoth's opinion, it should be the first step.
>> > > > > > > > > >
>> > > > > > > > > > No matter we hope Hudi to integrate with which computing
>> > > > > > > > > >
>> > > > > > > > > > framework.
>> > > > > > > > > >
>> > > > > > > > > > We need to decouple Hudi client and Spark.
>> > > > > > > > > >
>> > > > > > > > > > We may need a pure client module named for example
>> > > > > > > > > > hoodie-client-core(common)
>> > > > > > > > > >
>> > > > > > > > > > Then we could have: hoodie-client-spark,
>> > hoodie-client-flink
>> > > > and
>> > > > > > > > > > hoodie-client-beam
>> > > > > > > > > >
>> > > > > > > > > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日
>> 上午10:45写道：
>> > > > > > > > > >
>> > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's analysis.
>> > > > > > > > > >
>> > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <
>> > > > > > > > > >
>> > > > > > > > > > taher...@gmail.com>
>> > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > So the way to go around this is that file a hip. Chalk
>> all
>> > th
>> > > > > > > > > >
>> > > > > > > > > > classes
>> > > > > > > > > >
>> > > > > > > > > > our
>> > > > > > > > > >
>> > > > > > > > > > and start moving towards Pure client.
>> > > > > > > > > >
>> > > > > > > > > > Secondly should we want to try beam?
>> > > > > > > > > >
>> > > > > > > > > > I think there is to much going on here and I'm not able
>> to
>> > > > > > > > > >
>> > > > > > > > > > follow.
>> > > > > > > > > >
>> > > > > > > > > > If
>> > > > > > > > > >
>> > > > > > > > > > we
>> > > > > > > > > >
>> > > > > > > > > > want to try out beam all along I don't think it makes
>> sense
>> > > > > > > > > >
>> > > > > > > > > > to
>> > > > > > > > > >
>> > > > > > > > > > do
>> > > > > > > > > >
>> > > > > > > > > > anything
>> > > > > > > > > >
>> > > > > > > > > > on Flink then.
>> > > > > > > > > >
>> > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
>> > > > > > > > > >
>> > > > > > > > > > n...@semanticbeeng.com>
>> > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > >> +1 My money is on this approach.
>> > > > > > > > > > >>
>> > > > > > > > > > >> The existing abstractions from Beam seem enough for
>> the
>> > > use
>> > > > > > > > > >
>> > > > > > > > > > cases
>> > > > > > > > > >
>> > > > > > > > > > as I
>> > > > > > > > > >
>> > > > > > > > > > imagine them.
>> > > > > > > > > >
>> > > > > > > > > > >> Flink also has "dynamic table", "table source" and
>> > "table
>> > > > > > > > > >
>> > > > > > > > > > sink"
>> > > > > > > > > >
>> > > > > > > > > > which
>> > > > > > > > > >
>> > > > > > > > > > seem very useful abstractions where Hudi might fit
>> nicely.
>> > > > > > > > > >
>> > > > > > > > > > >>
>> > > > > > > > > > >>
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>> > > > > > > > > >
>> > > > > > > > > > >>
>> > > > > > > > > > >> Attached a screen shot.
>> > > > > > > > > > >>
>> > > > > > > > > > >> This seems to fit with the original premise of Hudi
>> as
>> > > well.
>> > > > > > > > > > >>
>> > > > > > > > > > >> Am exploring this venue with a use case that involves
>> > > > > > > > > >
>> > > > > > > > > > "temporal
>> > > > > > > > > >
>> > > > > > > > > > joins
>> > > > > > > > > >
>> > > > > > > > > > on
>> > > > > > > > > >
>> > > > > > > > > > streams" which I need for feature extraction.
>> > > > > > > > > >
>> > > > > > > > > > >> Anyone is interested in this or has concrete enough
>> > needs
>> > > > > > > > > >
>> > > > > > > > > > and
>> > > > > > > > > >
>> > > > > > > > > > use
>> > > > > > > > > >
>> > > > > > > > > > cases
>> > > > > > > > > >
>> > > > > > > > > > please let me know.
>> > > > > > > > > >
>> > > > > > > > > > >> Best to go from an agreed upon set of 2-3 use cases.
>> > > > > > > > > > >>
>> > > > > > > > > > >> Cheers
>> > > > > > > > > > >>
>> > > > > > > > > > >> Nick
>> > > > > > > > > > >>
>> > > > > > > > > > >>
>> > > > > > > > > > >> > Also, we do have some Beam experts on the mailing
>> > list..
>> > > > > > > > > >
>> > > > > > > > > > Can
>> > > > > > > > > >
>> > > > > > > > > > you
>> > > > > > > > > >
>> > > > > > > > > > please
>> > > > > > > > > > >> weigh on viability of using Beam as the intermediate
>> > > > > > > > > >
>> > > > > > > > > > abstraction
>> > > > > > > > > >
>> > > > > > > > > > here
>> > > > > > > > > >
>> > > > > > > > > > between Spark/Flink?
>> > > > > > > > > > Hudi uses RDD apis like groupBy, mapToPair,
>> > > > > > > > > >
>> > > > > > > > > > sortAndRepartition,
>> > > > > > > > > >
>> > > > > > > > > > reduceByKey, countByKey and also does custom
>> partitioning a
>> > > > > > > > > >
>> > > > > > > > > > lot.>
>> > > > > > > > > >
>> > > > > > > > > > >> >
>> > > > > > > > > > >>
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to