Re: [DISCUSS] Decouple Hudi and Spark

vbal...@apache.org Mon, 16 Sep 2019 12:49:38 -0700

 
+1 This is a pretty large undertaking. While the community is getting their 
hands dirty and ramping up on Hudi internals, it would be productive if Vinoth 
shepherds this
Balaji.V    On Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth Chandar 
<vin...@apache.org> wrote:  
 
 sg. :)


I will wait for others on this thread as well to chime in.

On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala <taher...@gmail.com> wrote:

> Vinoth, I think right now given your experience with the project you should
> be scoping out what needs to be done to take us there. So +1 for giving you
> more work :)
>
> We want to reach a point where we can start scoping out addition of Flink
> and Beam components within. Then I think will tremendous progress.
>
> On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar <vin...@apache.org> wrote:
>
> > I still feel the key thing here is reimplementing HoodieBloomIndex
> without
> > needing spark caching.
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global)
> >  documents the spark DAG in detail.
> >
> > If everyone feels, it's best for me to scope the work out, then happy to
> do
> > it!
> >
> > On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <taher...@gmail.com>
> > wrote:
> >
> > > Guys I think we are slowing down on this again. We need to start
> planning
> > > small small tasks towards this VC please can you help fast track this?
> > >
> > > Regards,
> > > Taher Koitawala
> > >
> > > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <vin...@apache.org>
> wrote:
> > >
> > > > Look forward to the analysis. A key class to read would be
> > > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > > >
> > > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <yanghua1...@gmail.com>
> > wrote:
> > > >
> > > > > >> Currently Spark Streaming micro batching fits well with Hudi,
> > since
> > > it
> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> micro
> > > > batch
> > > > > = 1 hudi commit
> > > > > With the per-record model in Flink, I am not sure how useful it
> will
> > be
> > > > to
> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> > will
> > > > be
> > > > > inefficient..
> > > > >
> > > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
> > About
> > > > > Flink streaming, we can also implement the "batch" and
> "micro-batch"
> > > > model
> > > > > when process data. For example:
> > > > >
> > > > >    - aggregation: use flexibility window mechanism;
> > > > >    - non-aggregation: use Flink stateful state API cache a batch
> data
> > > > >
> > > > >
> > > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
> > full
> > > > > summary of how Spark is being used in a wiki page is a good start
> > IMO.
> > > We
> > > > > can then hash out what can be generalized and what cannot be and
> > needs
> > > to
> > > > > be left in hudi-client-spark vs hudi-client-core
> > > > >
> > > > > agree
> > > > >
> > > > > Vinoth Chandar <vin...@apache.org> 于2019年8月14日周三 上午8:35写道：
> > > > >
> > > > > > >> We should only stick to Flink Streaming. Furthermore if there
> > is a
> > > > > > requirement for batch then users
> > > > > > >> should use Spark or then we will anyway have a beam
> integration
> > > > coming
> > > > > > up.
> > > > > >
> > > > > > Currently Spark Streaming micro batching fits well with Hudi,
> since
> > > it
> > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> > micro
> > > > > batch
> > > > > > = 1 hudi commit
> > > > > > With the per-record model in Flink, I am not sure how useful it
> > will
> > > be
> > > > > to
> > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> it
> > > will
> > > > > be
> > > > > > inefficient..
> > > > > >
> > > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
> > full
> > > > > > summary of how Spark is being used in a wiki page is a good start
> > > IMO.
> > > > We
> > > > > > can then hash out what can be generalized and what cannot be and
> > > needs
> > > > to
> > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <yanghua1...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Nick and Taher,
> > > > > > >
> > > > > > > I just want to answer Nishith's question. Reference his old
> > > > description
> > > > > > > here:
> > > > > > >
> > > > > > > > You can do a parallel investigation while we are deciding on
> > the
> > > > > module
> > > > > > > structure.  You could be looking at all the patterns in Hudi's
> > > Spark
> > > > > APIs
> > > > > > > usage (RDD/DataSource/SparkContext) and see if such support can
> > be
> > > > > > achieved
> > > > > > > in theory with Flink. If not, what is the workaround.
> Documenting
> > > > such
> > > > > > > patterns would be valuable when multiple engineers are working
> on
> > > it.
> > > > > For
> > > > > > > e:g, Hudi relies on    (a) custom partitioning logic for
> > upserts,
> > > > > >  (b)
> > > > > > > caching RDDs to avoid reruns of costly stages    (c) A Spark
> > > upsert
> > > > > task
> > > > > > > knowing its spark partition/task/attempt ids
> > > > > > >
> > > > > > > And just like the title of this thread, we are going to try to
> > > > decouple
> > > > > > > Hudi and Spark. That means we can run the whole Hudi without
> > > > depending
> > > > > > > Spark. So we need to analyze all the usage of Spark in Hudi.
> > > > > > >
> > > > > > > Here we are not discussing the integration of Hudi and Flink in
> > the
> > > > > > > application layer. Instead, I want Hudi to be decoupled from
> > Spark
> > > > and
> > > > > > > allow other engines (such as Flink) to replace Spark.
> > > > > > >
> > > > > > > It can be divided into long-term goals and short-term goals. As
> > > > Nishith
> > > > > > > stated in a recent email.
> > > > > > >
> > > > > > > I mentioned the Flink Batch API here because Hudi can connect
> > with
> > > > many
> > > > > > > different Source/Sinks. Some file-based reads are not
> appropriate
> > > for
> > > > > > Flink
> > > > > > > Streaming.
> > > > > > >
> > > > > > > Therefore, this is a comprehensive survey of the use of Spark
> in
> > > > Hudi.
> > > > > > >
> > > > > > > Best,
> > > > > > > Vino
> > > > > > >
> > > > > > >
> > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月13日周二 下午5:43写道：
> > > > > > >
> > > > > > > > Hi Vino,
> > > > > > > >      According to what I've seen Hudi has a lot of spark
> > > component
> > > > > > > flowing
> > > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts etc. The
> main
> > > > > > classes I
> > > > > > > > guess we should focus upon is HoodieTable and Hoodie write
> > > clients.
> > > > > > > >
> > > > > > > > Also Vino, I don't think we should be providing Flink dataset
> > > > > > > > implementation. We should only stick to Flink Streaming.
> > > > > > > >                Furthermore if there is a requirement for
> batch
> > > then
> > > > > > users
> > > > > > > > should use Spark or then we will anyway have a beam
> integration
> > > > > coming
> > > > > > > up.
> > > > > > > >
> > > > > > > > As of cache, How about we write our stateful Flink function
> and
> > > use
> > > > > > > > RocksDbStateBackend with some state TTL.
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang <
> yanghua1...@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > After doing some research, let me share my information:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    - Limitation of computing engine capabilities: Hudi uses
> > > > Spark's
> > > > > > > > >    RDD#persist, and Flink currently has no API to cache
> > > datasets.
> > > > > > Maybe
> > > > > > > > we
> > > > > > > > > can
> > > > > > > > >    only choose to use external storage or do not use cache?
> > For
> > > > the
> > > > > > use
> > > > > > > > of
> > > > > > > > >    other APIs, the two currently offer almost equivalent
> > > > > > capabilities.
> > > > > > > > >    - The abstraction of the computing engine is different:
> > > > > > Considering
> > > > > > > > the
> > > > > > > > >    different usage scenarios of the computing engine in
> Hudi,
> > > > Flink
> > > > > > has
> > > > > > > > not
> > > > > > > > >    yet implemented stream batch unification, so we may use
> > both
> > > > > > Flink's
> > > > > > > > >    DataSet API (batch processing) and DataStream API
> (stream
> > > > > > > processing).
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Vino
> > > > > > > > >
> > > > > > > > > nishith agarwal <n3.nas...@gmail.com> 于2019年8月8日周四
> > 上午12:57写道：
> > > > > > > > >
> > > > > > > > > > Nick,
> > > > > > > > > >
> > > > > > > > > > You bring up a good point about the non-trivial
> programming
> > > > model
> > > > > > > > > > differences between these different technologies. From a
> > > > > > theoretical
> > > > > > > > > > perspective, I'd say considering a higher level
> abstraction
> > > > makes
> > > > > > > > sense.
> > > > > > > > > I
> > > > > > > > > > think we have to decouple some objectives and concerns
> > here.
> > > > > > > > > >
> > > > > > > > > > a) The immediate desire is to have Hudi be able to run
> on a
> > > > Flink
> > > > > > (or
> > > > > > > > > > non-spark) engine. This naturally begs the question of
> > > > decoupling
> > > > > > > Hudi
> > > > > > > > > > concepts from direct Spark dependencies.
> > > > > > > > > >
> > > > > > > > > > b) If we do want to initiate the above effort, would it
> > make
> > > > > sense
> > > > > > to
> > > > > > > > > just
> > > > > > > > > > have a higher level abstraction, building on other
> > > technologies
> > > > > > like
> > > > > > > > beam
> > > > > > > > > > (euphoria etc) and provide single, clean API's that may
> be
> > > more
> > > > > > > > > > maintainable from a code perspective. But at the same
> time
> > > this
> > > > > > will
> > > > > > > > > > introduce challenges on how to maintain efficiency and
> > > > optimized
> > > > > > > > runtime
> > > > > > > > > > dags for Hudi (since the code would move away from point
> > > > > > integrations
> > > > > > > > and
> > > > > > > > > > whenever this happens, tuning natively for specific
> engines
> > > > > becomes
> > > > > > > > more
> > > > > > > > > > and more difficult).
> > > > > > > > > >
> > > > > > > > > > My general opinion is that, as the community grows over
> > time
> > > > with
> > > > > > > more
> > > > > > > > > > folks having an in-depth understanding of Hudi, going
> from
> > > > > > > > current_state
> > > > > > > > > ->
> > > > > > > > > > (a) -> (b) might be the most reliable and adoptable path
> > for
> > > > this
> > > > > > > > > project.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Nishith
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng <
> > > > > > > n...@semanticbeeng.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > There are some not trivial difference between
> programming
> > > > model
> > > > > > and
> > > > > > > > > > > runtime semantics between Beam, Spark and Flink.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > > > > > > > > > >
> > > > > > > > > > > Nitish, Vino - thoughts?
> > > > > > > > > > >
> > > > > > > > > > > Does it feel to consider a higher level abstraction /
> DSL
> > > > > instead
> > > > > > > of
> > > > > > > > > > > maintaining different code with same functionality but
> > > > > different
> > > > > > > > > > > programming models ?
> > > > > > > > > > >
> > > > > > > > > > >
> > https://beam.apache.org/documentation/sdks/java/euphoria/
> > > > > > > > > > >
> > > > > > > > > > > Nick
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On August 6, 2019 at 4:04 PM nishith agarwal <
> > > > > > n3.nas...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > +1 for Approach 1 Point integration with each
> framework.
> > > > > > > > > > >
> > > > > > > > > > > Pros for point integration
> > > > > > > > > > >
> > > > > > > > > > >    - Hudi community is already familiar with spark and
> > > spark
> > > > > > based
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > actions/shuffles etc. Since both modules can be
> > decoupled,
> > > > this
> > > > > > > > enables
> > > > > > > > > > us
> > > > > > > > > > > to have a steady release for Hudi for 1 execution
> engine
> > > > > (spark)
> > > > > > > > while
> > > > > > > > > we
> > > > > > > > > > > hone our skills and iterate on making flink dag
> > optimized,
> > > > > > > performant
> > > > > > > > > > with
> > > > > > > > > > > the right configuration.
> > > > > > > > > > >
> > > > > > > > > > >    - This might be a stepping stone towards rewriting
> the
> > > > > entire
> > > > > > > code
> > > > > > > > > > base
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > being agnostic of spark/flink. This approach will help
> us
> > > fix
> > > > > > > tests,
> > > > > > > > > > > intricacies and help make the code base ready for a
> > larger
> > > > > > rework.
> > > > > > > > > > >
> > > > > > > > > > >    - Seems like the easiest way to add flink support
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Cons
> > > > > > > > > > >
> > > > > > > > > > >    - More code paths to maintain and reason since the
> > spark
> > > > and
> > > > > > > flink
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > integrations will naturally diverge over time.
> > > > > > > > > > >
> > > > > > > > > > > Theoretically, I do like the idea of being able to run
> > the
> > > > hudi
> > > > > > dag
> > > > > > > > on
> > > > > > > > > > beam
> > > > > > > > > > > more than point integrations, where there is one
> > API/logic
> > > to
> > > > > > > reason
> > > > > > > > > > about.
> > > > > > > > > > > But practically, that may not be the right direction.
> > > > > > > > > > >
> > > > > > > > > > > Pros
> > > > > > > > > > >
> > > > > > > > > > >    - Lesser cognitive burden in maintaining, evolving
> and
> > > > > > releasing
> > > > > > > > the
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > project with one API to reason with.
> > > > > > > > > > >
> > > > > > > > > > >    - Theoretically, going forward assuming beam is
> > adopted
> > > > as a
> > > > > > > > > standard
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > programming paradigm for stream/batch, this would
> enable
> > > > > > consumers
> > > > > > > > > > leverage
> > > > > > > > > > > the power of hudi more easily.
> > > > > > > > > > >
> > > > > > > > > > > Cons
> > > > > > > > > > >
> > > > > > > > > > >    - Massive rewrite of the code base. Additionally,
> > since
> > > we
> > > > > > would
> > > > > > > > > have
> > > > > > > > > > >    moved
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > away from directly using spark APIs, there is a bigger
> > risk
> > > > of
> > > > > > > > > > regression.
> > > > > > > > > > > We would have to be very thorough with all the
> > intricacies
> > > > and
> > > > > > > ensure
> > > > > > > > > the
> > > > > > > > > > > same stability of new releases.
> > > > > > > > > > >
> > > > > > > > > > >    - Managing future features (which may be very spark
> > > > driven)
> > > > > > will
> > > > > > > > > > either
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > clash or pause or will need to be reworked.
> > > > > > > > > > >
> > > > > > > > > > >    - Tuning jobs for Spark/Flink type execution
> > frameworks
> > > > > > > > individually
> > > > > > > > > > >    might
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > be difficult and will get difficult over time as the
> > > project
> > > > > > > evolves,
> > > > > > > > > > where
> > > > > > > > > > > some beam integrations with spark/flink may not work as
> > > > > expected.
> > > > > > > > > > >
> > > > > > > > > > >    - Also, as pointed above, need to probably support
> the
> > > > > > > > hoodie-spark
> > > > > > > > > > >    module
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > as a first-class.
> > > > > > > > > > >
> > > > > > > > > > > Thank,
> > > > > > > > > > > Nishith
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala <
> > > > > > taher...@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Vinoth,
> > > > > > > > > > > Are there some tasks I can take up to ramp up the code?
> > > Want
> > > > to
> > > > > > get
> > > > > > > > > > > more used to the code and understand the existing
> > > > > implementation
> > > > > > > > > better.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Taher Koitawala
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar <
> > > > > vin...@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Let's see if others have any thoughts as well. We can
> > plan
> > > to
> > > > > fix
> > > > > > > the
> > > > > > > > > > > approach by EOW.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <
> > > > > yanghua1...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi guys,
> > > > > > > > > > >
> > > > > > > > > > > Also, +1 for Approach 1 like Taher.
> > > > > > > > > > >
> > > > > > > > > > > If we can do a comprehensive analysis of this model and
> > > come
> > > > up
> > > > > > > with.
> > > > > > > > > > >
> > > > > > > > > > > means
> > > > > > > > > > >
> > > > > > > > > > > to refactor this cleanly, this would be promising.
> > > > > > > > > > >
> > > > > > > > > > > Yes, when we get the conclusion, we could start this
> > work.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Vino
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月6日周二
> > > 上午12:28写道：
> > > > > > > > > > >
> > > > > > > > > > > +1 for Approch 1 Point integration with each framework
> > > > > > > > > > >
> > > > > > > > > > > Approach 2 has a problem as you said "Developers need
> to
> > > > think
> > > > > > > about
> > > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So
> in
> > > the
> > > > > end,
> > > > > > > > > > >
> > > > > > > > > > > this
> > > > > > > > > > >
> > > > > > > > > > > may
> > > > > > > > > > >
> > > > > > > > > > > not be the panacea that it seems to be"
> > > > > > > > > > >
> > > > > > > > > > > We have seen various pipelines in the beam dag being
> > > > expressed
> > > > > > > > > > >
> > > > > > > > > > > differently
> > > > > > > > > > >
> > > > > > > > > > > then we had them in our original usecase. And also
> > > switching
> > > > > > > between
> > > > > > > > > > >
> > > > > > > > > > > spark
> > > > > > > > > > >
> > > > > > > > > > > and Flink runners in beam have various impact on the
> > > > pipelines
> > > > > > like
> > > > > > > > > > >
> > > > > > > > > > > some
> > > > > > > > > > >
> > > > > > > > > > > features available in Flink are not available on the
> > spark
> > > > > runner
> > > > > > > > > > >
> > > > > > > > > > > etc.
> > > > > > > > > > >
> > > > > > > > > > > Refer to this compatible matrix ->
> > > > > > > > > > >
> > > > > https://beam.apache.org/documentation/runners/capability-matrix/
> > > > > > > > > > >
> > > > > > > > > > > Hence my vote on Approch 1 let's decouple and build the
> > > > > abstract
> > > > > > > for
> > > > > > > > > > >
> > > > > > > > > > > each
> > > > > > > > > > >
> > > > > > > > > > > framework. That is a much better option. We will also
> > have
> > > > more
> > > > > > > > > > >
> > > > > > > > > > > control
> > > > > > > > > > >
> > > > > > > > > > > over each framework's implement.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <
> > > > vin...@apache.org
> > > > > >
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Would like to highlight that there are two distinct
> > > > approaches
> > > > > > here
> > > > > > > > > > >
> > > > > > > > > > > with
> > > > > > > > > > >
> > > > > > > > > > > different tradeoffs. Think of this as my braindump, as
> I
> > > have
> > > > > > been
> > > > > > > > > > >
> > > > > > > > > > > thinking
> > > > > > > > > > >
> > > > > > > > > > > about this quite a bit in the past.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *Approach 1 : Point integration with each framework *
> > > > > > > > > > >
> > > > > > > > > > > We may need a pure client module named for example
> > > > > > > > > > > hoodie-client-core(common)
> > > > > > > > > > > >> Then we could have: hoodie-client-spark,
> > > > hoodie-client-flink
> > > > > > > > > > > and hoodie-client-beam
> > > > > > > > > > >
> > > > > > > > > > > (+) This is the safest to do IMO, since we can isolate
> > the
> > > > > > current
> > > > > > > > > > >
> > > > > > > > > > > Spark
> > > > > > > > > > >
> > > > > > > > > > > execution (hoodie-spark, hoodie-client-spark) from the
> > > > changes
> > > > > > for
> > > > > > > > > > >
> > > > > > > > > > > flink,
> > > > > > > > > > >
> > > > > > > > > > > while it stabilizes over few releases.
> > > > > > > > > > > (-) Downside is that the utilities needs to be redone :
> > > > > > > > > > > hoodie-utilities-spark and hoodie-utilities-flink and
> > > > > > > > > > > hoodie-utilities-core ? hoodie-cli?
> > > > > > > > > > >
> > > > > > > > > > > If we can do a comprehensive analysis of this model and
> > > come
> > > > up
> > > > > > > > > > >
> > > > > > > > > > > with.
> > > > > > > > > > >
> > > > > > > > > > > means
> > > > > > > > > > >
> > > > > > > > > > > to refactor this cleanly, this would be promising.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *Approach 2: Beam as the compute abstraction*
> > > > > > > > > > >
> > > > > > > > > > > Another more drastic approach is to remove Spark as the
> > > > compute
> > > > > > > > > > >
> > > > > > > > > > > abstraction
> > > > > > > > > > >
> > > > > > > > > > > for writing data and replace it with Beam.
> > > > > > > > > > >
> > > > > > > > > > > (+) All of the code remains more or less similar and
> > there
> > > is
> > > > > one
> > > > > > > > > > >
> > > > > > > > > > > compute
> > > > > > > > > > >
> > > > > > > > > > > API to reason about.
> > > > > > > > > > >
> > > > > > > > > > > (-) The (very big) assumption here is that we are able
> to
> > > > tune
> > > > > > the
> > > > > > > > > > >
> > > > > > > > > > > spark
> > > > > > > > > > >
> > > > > > > > > > > runtime the same way using Beam : custom partitioners,
> > > > support
> > > > > > for
> > > > > > > > > > >
> > > > > > > > > > > all
> > > > > > > > > > >
> > > > > > > > > > > RDD
> > > > > > > > > > >
> > > > > > > > > > > operations we invoke, caching etc etc.
> > > > > > > > > > > (-) It will be a massive rewrite and testing of such a
> > > large
> > > > > > > > > > >
> > > > > > > > > > > rewrite
> > > > > > > > > > >
> > > > > > > > > > > would
> > > > > > > > > > >
> > > > > > > > > > > also be really challenging, since we need to pay
> > attention
> > > to
> > > > > all
> > > > > > > > > > >
> > > > > > > > > > > intricate
> > > > > > > > > > >
> > > > > > > > > > > details to ensure the spark users today experience no
> > > > > > > > > > > regressions/side-effects
> > > > > > > > > > > (-) Note that we still need to probably support the
> > > > > hoodie-spark
> > > > > > > > > > >
> > > > > > > > > > > module
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > >
> > > > > > > > > > > may be a first-class such integration with flink, for
> > > native
> > > > > > > > > > >
> > > > > > > > > > > flink/spark
> > > > > > > > > > >
> > > > > > > > > > > pipeline authoring. Users of say DeltaStreamer need to
> > pass
> > > > in
> > > > > > > > > > >
> > > > > > > > > > > Spark
> > > > > > > > > > >
> > > > > > > > > > > or
> > > > > > > > > > >
> > > > > > > > > > > Flink configs anyway.. Developers need to think about
> > > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So
> in
> > > the
> > > > > end,
> > > > > > > > > > >
> > > > > > > > > > > this
> > > > > > > > > > >
> > > > > > > > > > > may
> > > > > > > > > > >
> > > > > > > > > > > not be the panacea that it seems to be.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > One goal for the HIP is to get us all to agree as a
> > > community
> > > > > > which
> > > > > > > > > > >
> > > > > > > > > > > one
> > > > > > > > > > >
> > > > > > > > > > > to
> > > > > > > > > > >
> > > > > > > > > > > pick, with sufficient investigation, testing,
> > > benchmarking..
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <
> > > > > yanghua1...@gmail.com>
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > +1 for both Beam and Flink
> > > > > > > > > > >
> > > > > > > > > > > First step here is to probably draw out current
> > hierrarchy
> > > > and
> > > > > > > > > > >
> > > > > > > > > > > figure
> > > > > > > > > > >
> > > > > > > > > > > out
> > > > > > > > > > >
> > > > > > > > > > > what the abstraction points are..
> > > > > > > > > > > In my opinion, the runtime (spark, flink) should be
> done
> > at
> > > > the
> > > > > > > > > > > hoodie-client level and just used by hoodie-utilties
> > > > > > > > > > >
> > > > > > > > > > > seamlessly..
> > > > > > > > > > >
> > > > > > > > > > > +1 for Vinoth's opinion, it should be the first step.
> > > > > > > > > > >
> > > > > > > > > > > No matter we hope Hudi to integrate with which
> computing
> > > > > > > > > > >
> > > > > > > > > > > framework.
> > > > > > > > > > >
> > > > > > > > > > > We need to decouple Hudi client and Spark.
> > > > > > > > > > >
> > > > > > > > > > > We may need a pure client module named for example
> > > > > > > > > > > hoodie-client-core(common)
> > > > > > > > > > >
> > > > > > > > > > > Then we could have: hoodie-client-spark,
> > > hoodie-client-flink
> > > > > and
> > > > > > > > > > > hoodie-client-beam
> > > > > > > > > > >
> > > > > > > > > > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日
> > 上午10:45写道：
> > > > > > > > > > >
> > > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's analysis.
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <
> > > > > > > > > > >
> > > > > > > > > > > taher...@gmail.com>
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > So the way to go around this is that file a hip. Chalk
> > all
> > > th
> > > > > > > > > > >
> > > > > > > > > > > classes
> > > > > > > > > > >
> > > > > > > > > > > our
> > > > > > > > > > >
> > > > > > > > > > > and start moving towards Pure client.
> > > > > > > > > > >
> > > > > > > > > > > Secondly should we want to try beam?
> > > > > > > > > > >
> > > > > > > > > > > I think there is to much going on here and I'm not able
> > to
> > > > > > > > > > >
> > > > > > > > > > > follow.
> > > > > > > > > > >
> > > > > > > > > > > If
> > > > > > > > > > >
> > > > > > > > > > > we
> > > > > > > > > > >
> > > > > > > > > > > want to try out beam all along I don't think it makes
> > sense
> > > > > > > > > > >
> > > > > > > > > > > to
> > > > > > > > > > >
> > > > > > > > > > > do
> > > > > > > > > > >
> > > > > > > > > > > anything
> > > > > > > > > > >
> > > > > > > > > > > on Flink then.
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
> > > > > > > > > > >
> > > > > > > > > > > n...@semanticbeeng.com>
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > >> +1 My money is on this approach.
> > > > > > > > > > > >>
> > > > > > > > > > > >> The existing abstractions from Beam seem enough for
> > the
> > > > use
> > > > > > > > > > >
> > > > > > > > > > > cases
> > > > > > > > > > >
> > > > > > > > > > > as I
> > > > > > > > > > >
> > > > > > > > > > > imagine them.
> > > > > > > > > > >
> > > > > > > > > > > >> Flink also has "dynamic table", "table source" and
> > > "table
> > > > > > > > > > >
> > > > > > > > > > > sink"
> > > > > > > > > > >
> > > > > > > > > > > which
> > > > > > > > > > >
> > > > > > > > > > > seem very useful abstractions where Hudi might fit
> > nicely.
> > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> Attached a screen shot.
> > > > > > > > > > > >>
> > > > > > > > > > > >> This seems to fit with the original premise of Hudi
> as
> > > > well.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Am exploring this venue with a use case that
> involves
> > > > > > > > > > >
> > > > > > > > > > > "temporal
> > > > > > > > > > >
> > > > > > > > > > > joins
> > > > > > > > > > >
> > > > > > > > > > > on
> > > > > > > > > > >
> > > > > > > > > > > streams" which I need for feature extraction.
> > > > > > > > > > >
> > > > > > > > > > > >> Anyone is interested in this or has concrete enough
> > > needs
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > >
> > > > > > > > > > > use
> > > > > > > > > > >
> > > > > > > > > > > cases
> > > > > > > > > > >
> > > > > > > > > > > please let me know.
> > > > > > > > > > >
> > > > > > > > > > > >> Best to go from an agreed upon set of 2-3 use cases.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Cheers
> > > > > > > > > > > >>
> > > > > > > > > > > >> Nick
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Also, we do have some Beam experts on the mailing
> > > list..
> > > > > > > > > > >
> > > > > > > > > > > Can
> > > > > > > > > > >
> > > > > > > > > > > you
> > > > > > > > > > >
> > > > > > > > > > > please
> > > > > > > > > > > >> weigh on viability of using Beam as the intermediate
> > > > > > > > > > >
> > > > > > > > > > > abstraction
> > > > > > > > > > >
> > > > > > > > > > > here
> > > > > > > > > > >
> > > > > > > > > > > between Spark/Flink?
> > > > > > > > > > > Hudi uses RDD apis like groupBy, mapToPair,
> > > > > > > > > > >
> > > > > > > > > > > sortAndRepartition,
> > > > > > > > > > >
> > > > > > > > > > > reduceByKey, countByKey and also does custom
> > partitioning a
> > > > > > > > > > >
> > > > > > > > > > > lot.>
> > > > > > > > > > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to