Re: [DISCUSS] Decouple Hudi and Spark

Taher Koitawala Tue, 24 Sep 2019 19:52:33 -0700

Hi Vino,
      This is not a design for Hudi on Flink. This was simply a mock up of
tagLocations() spark cache to Flink state as Vinoth wanted to see.


As per the Flink batch and Streaming I am well aware of the batch and
Stream unification efforts of Flink. However I think that is still on
progress and for now to batch on Hudi let's stick to Spark, however only
for streaming let's use Flink streaming.



Regards,
Taher Koitawala

On Wed, Sep 25, 2019, 7:52 AM vino yang <yanghua1...@gmail.com> wrote:

> Hi Taher,
>
> As I mentioned in the previous mail. Things may not be too easy by just
> using Flink state API.
>
> Copied here "Hudi can connect with many different Source/Sinks. Some
> file-based reads are not appropriate for Flink Streaming."
>
> Although, unify Batch and Streaming is Flink's goal. But, it is difficult
> to ignore Flink Batch API to match some features provide by Hudi now.
>
> The example you provided is in application layer about usage. So my
> suggestion is be patient, it needs time to give an detailed design.
>
> Best,
> Vino
>
>
>
> On 09/24/2019 22:38, Taher Koitawala <taher...@gmail.com> wrote:
> Hi All,
>              Sample code to see how records tagging will be handled in
> Flink is posted on [1]. The main class to run the same is MockHudi.java
> with a sample path for checkpointing.
>
> As of now this is just a sample to know we should ke caching in Flink
> states with bare minimum configs.
>
>
> As per my experience I have cached around 10s of TBs in Flink rocksDB
> state
> with the right configs. So I'm sure it should work here as well.
>
> 1:
>
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
>
> Regards,
> Taher Koitawala
>
>
> On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar <vin...@apache.org> wrote:
>
> > It wont be much different than the HBaseIndex we have today. Would like
> to
> > have always have an option like BloomIndex that does not need any
> external
> > dependencies.
> > The moment you bring an external data store in, someone becomes a DBA.
> :)
> >
> > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng <n...@semanticbeeng.com>
> > wrote:
> >
> > > @vc can you see how ApacheCrail could be used to implement this at
> scale
> > > but also in a way that abstracts over both Spark and Flink?
> > >
> > > "Crail Store implements a hierarchical namespace across a cluster of
> RDMA
> > > interconnected storage resources such as DRAM or flash"
> > >
> > > https://crail.incubator.apache.org/overview/
> > >
> > > + 2 cents
> > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20
> > >
> > > Cheers
> > >
> > > Nick
> > >
> > > On September 22, 2019 at 9:28 AM Vinoth Chandar <vin...@apache.org>
> > wrote:
> > >
> > >
> > > It could be much larger. :) imagine billions of keys each 32 bytes,
> > mapped
> > > to another 32 byte
> > >
> > > The advantage of the current bloom index is that its effectively
> stored
> > > with data itself and this reduces complexity in terms of keeping index
> > and
> > > data consistent etc
> > >
> > > One orthogonal idea from long time ago that moves indexing out of data
> > > storage and is generalizable
> > >
> > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index
> > >
> > > If someone here knows flink well and can implement some standalone
> flink
> > > code to mimic tagLocation() functionality and share with the group,
> that
> > > would be great. Lets worry about performance once we have a flink DAG.
> I
> > > think this is a critical and most tricky piece in supporting flink.
> > >
> > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil <vinay18.pa...@gmail.com>
> > > wrote:
> > >
> > > Hi Taher,
> > >
> > > I agree with this , if the state is becoming too large we should have
> an
> > > option of storing it in external state like File System or RocksDb.
> > >
> > > @Vinoth Chandar <vin...@apache.org> can the state of HoodieBloomIndex
> go
> > > beyond 10-15 GB
> > >
> > > Regards,
> > > Vinay Patil
> > >
> > > >
> > >
> > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <taher...@gmail.com>
> > > wrote:
> > >
> > > >> Hey Guys, Any thoughts on the above idea? To handle
> HoodieBloomIndex
> > > with
> > > >> HeapState, RocksDBState and FsState but on Spark.
> > > >>
> > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <taher...@gmail.com>
>
> > > >> wrote:
> > > >>
> > > >> > Hi Vinoth,
> > > >> > Having seen the doc and code. I understand the
> > > >> > HoodieBloomIndex mainly caches key and partition path. Can we
> > address
> > > >> how
> > > >> > Flink does it? Like, have HeapState where the user chooses to
> cache
> > > the
> > > >> > Index on heap, RockDBState where indexes are written to RocksDB
> and
> > > >> finally
> > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And
> on
> > > top,
> > > >> we
> > > >> > can do an index Time To Live.
> > > >> >
> > > >> > Regards,
> > > >> > Taher Koitawala
> > > >> >
> > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar <
> vin...@apache.org>
> > > >> wrote:
> > > >> >
> > > >> >> I still feel the key thing here is reimplementing
> HoodieBloomIndex
> > > >> without
> > > >> >> needing spark caching.
> > > >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global
>
> > > )
> > > >> >> documents the spark DAG in detail.
> > > >> >>
> > > >> >> If everyone feels, it's best for me to scope the work out, then
> > happy
> > > >> to
> > > >> >> do
> > > >> >> it!
> > > >> >>
> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <
> > taher...@gmail.com
> > > >
> > > >> >> wrote:
> > > >> >>
> > > >> >> > Guys I think we are slowing down on this again. We need to
> start
> > > >> >> planning
> > > >> >> > small small tasks towards this VC please can you help fast
> track
> > > >> this?
> > > >> >> >
> > > >> >> > Regards,
> > > >> >> > Taher Koitawala
> > > >> >> >
> > > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <
> vin...@apache.org
> > >
> > > >> >> wrote:
> > > >> >> >
> > > >> >> > > Look forward to the analysis. A key class to read would be
> > > >> >> > > HoodieBloomIndex, which uses a lot of spark caching and
> > shuffles.
> > > >> >> > >
> > > >> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <
> > yanghua1...@gmail.com
> > > >
> > > >> >> wrote:
> > > >> >> > >
> > > >> >> > > > >> Currently Spark Streaming micro batching fits well with
> > > Hudi,
> > > >> >> since
> > > >> >> > it
> > > >> >> > > > amortizes the cost of indexing, workload profiling etc. 1
> > spark
> > > >> >> micro
> > > >> >> > > batch
> > > >> >> > > > = 1 hudi commit
> > > >> >> > > > With the per-record model in Flink, I am not sure how
> useful
> > it
> > > >> >> will be
> > > >> >> > > to
> > > >> >> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi
> > commit,
> > > >> it
> > > >> >> will
> > > >> >> > > be
> > > >> >> > > > inefficient..
> > > >> >> > > >
> > > >> >> > > > Yes, if 1 input record = 1 hudi commit, it would be
> > > inefficient.
> > > >> >> About
> > > >> >> > > > Flink streaming, we can also implement the "batch" and
> > > >> "micro-batch"
> > > >> >> > > model
> > > >> >> > > > when process data. For example:
> > > >> >> > > >
> > > >> >> > > > - aggregation: use flexibility window mechanism;
> > > >> >> > > > - non-aggregation: use Flink stateful state API cache a
> batch
> > > >> >> data
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > >> On first focussing on decoupling of Spark and Hudi
> alone,
> > > yes
> > > >> a
> > > >> >> full
> > > >> >> > > > summary of how Spark is being used in a wiki page is a
> good
> > > start
> > > >> >> IMO.
> > > >> >> > We
> > > >> >> > > > can then hash out what can be generalized and what cannot
> be
> > > and
> > > >> >> needs
> > > >> >> > to
> > > >> >> > > > be left in hudi-client-spark vs hudi-client-core
> > > >> >> > > >
> > > >> >> > > > agree
> > > >> >> > > >
> > > >> >> > > > Vinoth Chandar <vin...@apache.org> 于2019年8月14日周三
> 上午8:35写道：
> > > >> >> > > >
> > > >> >> > > > > >> We should only stick to Flink Streaming. Furthermore
> if
> > > >> there
> > > >> >> is a
> > > >> >> > > > > requirement for batch then users
> > > >> >> > > > > >> should use Spark or then we will anyway have a beam
> > > >> integration
> > > >> >> > > coming
> > > >> >> > > > > up.
> > > >> >> > > > >
> > > >> >> > > > > Currently Spark Streaming micro batching fits well with
> > Hudi,
> > > >> >> since
> > > >> >> > it
> > > >> >> > > > > amortizes the cost of indexing, workload profiling etc.
> 1
> > > spark
> > > >> >> micro
> > > >> >> > > > batch
> > > >> >> > > > > = 1 hudi commit
> > > >> >> > > > > With the per-record model in Flink, I am not sure how
> > useful
> > > it
> > > >> >> will
> > > >> >> > be
> > > >> >> > > > to
> > > >> >> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi
> > > >> commit, it
> > > >> >> > will
> > > >> >> > > > be
> > > >> >> > > > > inefficient..
> > > >> >> > > > >
> > > >> >> > > > > On first focussing on decoupling of Spark and Hudi
> alone,
> > > yes a
> > > >> >> full
> > > >> >> > > > > summary of how Spark is being used in a wiki page is a
> good
> > > >> start
> > > >> >> > IMO.
> > > >> >> > > We
> > > >> >> > > > > can then hash out what can be generalized and what
> cannot
> > be
> > > >> and
> > > >> >> > needs
> > > >> >> > > to
> > > >> >> > > > > be left in hudi-client-spark vs hudi-client-core
> > > >> >> > > > >
> > > >> >> > > > >
> > > >> >> > > > >
> > > >> >> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <
> > > >> yanghua1...@gmail.com>
> > > >> >> > > wrote:
> > > >> >> > > > >
> > > >> >> > > > > > Hi Nick and Taher,
> > > >> >> > > > > >
> > > >> >> > > > > > I just want to answer Nishith's question. Reference
> his
> > old
> > > >> >> > > description
> > > >> >> > > > > > here:
> > > >> >> > > > > >
> > > >> >> > > > > > > You can do a parallel investigation while we are
> > deciding
> > > >> on
> > > >> >> the
> > > >> >> > > > module
> > > >> >> > > > > > structure. You could be looking at all the patterns in
> > > >> Hudi's
> > > >> >> > Spark
> > > >> >> > > > APIs
> > > >> >> > > > > > usage (RDD/DataSource/SparkContext) and see if such
> > support
> > > >> can
> > > >> >> be
> > > >> >> > > > > achieved
> > > >> >> > > > > > in theory with Flink. If not, what is the workaround.
> > > >> >> Documenting
> > > >> >> > > such
> > > >> >> > > > > > patterns would be valuable when multiple engineers are
> > > >> working
> > > >> >> on
> > > >> >> > it.
> > > >> >> > > > For
> > > >> >> > > > > > e:g, Hudi relies on (a) custom partitioning logic for
> > > >> >> upserts,
> > > >> >> > > > > (b)
> > > >> >> > > > > > caching RDDs to avoid reruns of costly stages (c) A
> Spark
> > > >> >> > upsert
> > > >> >> > > > task
> > > >> >> > > > > > knowing its spark partition/task/attempt ids
> > > >> >> > > > > >
> > > >> >> > > > > > And just like the title of this thread, we are going
> to
> > try
> > > >> to
> > > >> >> > > decouple
> > > >> >> > > > > > Hudi and Spark. That means we can run the whole Hudi
> > > without
> > > >> >> > > depending
> > > >> >> > > > > > Spark. So we need to analyze all the usage of Spark in
> > > Hudi.
> > > >> >> > > > > >
> > > >> >> > > > > > Here we are not discussing the integration of Hudi and
> > > Flink
> > > >> in
> > > >> >> the
> > > >> >> > > > > > application layer. Instead, I want Hudi to be
> decoupled
> > > from
> > > >> >> Spark
> > > >> >> > > and
> > > >> >> > > > > > allow other engines (such as Flink) to replace Spark.
> > > >> >> > > > > >
> > > >> >> > > > > > It can be divided into long-term goals and short-term
> > > goals.
> > > >> As
> > > >> >> > > Nishith
> > > >> >> > > > > > stated in a recent email.
> > > >> >> > > > > >
> > > >> >> > > > > > I mentioned the Flink Batch API here because Hudi can
> > > connect
> > > >> >> with
> > > >> >> > > many
> > > >> >> > > > > > different Source/Sinks. Some file-based reads are not
> > > >> >> appropriate
> > > >> >> > for
> > > >> >> > > > > Flink
> > > >> >> > > > > > Streaming.
> > > >> >> > > > > >
> > > >> >> > > > > > Therefore, this is a comprehensive survey of the use
> of
> > > >> Spark in
> > > >> >> > > Hudi.
> > > >> >> > > > > >
> > > >> >> > > > > > Best,
> > > >> >> > > > > > Vino
> > > >> >> > > > > >
> > > >> >> > > > > >
> > > >> >> > > > > > taher koitawala <taher...@gmail.com> 于2019年8月13日周二
> > > 下午5:43写道：
> > > >> >> > > > > >
> > > >> >> > > > > > > Hi Vino,
> > > >> >> > > > > > > According to what I've seen Hudi has a lot of spark
> > > >> >> > component
> > > >> >> > > > > > flowing
> > > >> >> > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts
> etc.
> > > The
> > > >> >> main
> > > >> >> > > > > classes I
> > > >> >> > > > > > > guess we should focus upon is HoodieTable and Hoodie
> > > write
> > > >> >> > clients.
> > > >> >> > > > > > >
> > > >> >> > > > > > > Also Vino, I don't think we should be providing
> Flink
> > > >> dataset
> > > >> >> > > > > > > implementation. We should only stick to Flink
> > Streaming.
> > > >> >> > > > > > > Furthermore if there is a requirement for
> > > >> batch
> > > >> >> > then
> > > >> >> > > > > users
> > > >> >> > > > > > > should use Spark or then we will anyway have a beam
> > > >> >> integration
> > > >> >> > > > coming
> > > >> >> > > > > > up.
> > > >> >> > > > > > >
> > > >> >> > > > > > > As of cache, How about we write our stateful Flink
> > > function
> > > >> >> and
> > > >> >> > use
> > > >> >> > > > > > > RocksDbStateBackend with some state TTL.
> > > >> >> > > > > > >
> > > >> >> > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang <
> > > >> >> yanghua1...@gmail.com>
> > > >> >> > > > wrote:
> > > >> >> > > > > > >
> > > >> >> > > > > > > > Hi all,
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > After doing some research, let me share my
> > information:
> > > >> >> > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > - Limitation of computing engine capabilities:
> Hudi
> > > >> uses
> > > >> >> > > Spark's
> > > >> >> > > > > > > > RDD#persist, and Flink currently has no API to
> cache
> > > >> >> > datasets.
> > > >> >> > > > > Maybe
> > > >> >> > > > > > > we
> > > >> >> > > > > > > > can
> > > >> >> > > > > > > > only choose to use external storage or do not use
> > > >> cache?
> > > >> >> For
> > > >> >> > > the
> > > >> >> > > > > use
> > > >> >> > > > > > > of
> > > >> >> > > > > > > > other APIs, the two currently offer almost
> equivalent
> > > >> >> > > > > capabilities.
> > > >> >> > > > > > > > - The abstraction of the computing engine is
> > > >> different:
> > > >> >> > > > > Considering
> > > >> >> > > > > > > the
> > > >> >> > > > > > > > different usage scenarios of the computing engine
> in
> > > >> >> Hudi,
> > > >> >> > > Flink
> > > >> >> > > > > has
> > > >> >> > > > > > > not
> > > >> >> > > > > > > > yet implemented stream batch unification, so we
> may
> > > >> use
> > > >> >> both
> > > >> >> > > > > Flink's
> > > >> >> > > > > > > > DataSet API (batch processing) and DataStream API
> > > >> (stream
> > > >> >> > > > > > processing).
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Best,
> > > >> >> > > > > > > > Vino
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > nishith agarwal <n3.nas...@gmail.com>
> 于2019年8月8日周四
> > > >> >> 上午12:57写道：
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > > Nick,
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > You bring up a good point about the non-trivial
> > > >> >> programming
> > > >> >> > > model
> > > >> >> > > > > > > > > differences between these different
> technologies.
> > > From
> > > >> a
> > > >> >> > > > > theoretical
> > > >> >> > > > > > > > > perspective, I'd say considering a higher level
> > > >> >> abstraction
> > > >> >> > > makes
> > > >> >> > > > > > > sense.
> > > >> >> > > > > > > > I
> > > >> >> > > > > > > > > think we have to decouple some objectives and
> > > concerns
> > > >> >> here.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > a) The immediate desire is to have Hudi be able
> to
> > > run
> > > >> on
> > > >> >> a
> > > >> >> > > Flink
> > > >> >> > > > > (or
> > > >> >> > > > > > > > > non-spark) engine. This naturally begs the
> question
> > > of
> > > >> >> > > decoupling
> > > >> >> > > > > > Hudi
> > > >> >> > > > > > > > > concepts from direct Spark dependencies.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > b) If we do want to initiate the above effort,
> > would
> > > it
> > > >> >> make
> > > >> >> > > > sense
> > > >> >> > > > > to
> > > >> >> > > > > > > > just
> > > >> >> > > > > > > > > have a higher level abstraction, building on
> other
> > > >> >> > technologies
> > > >> >> > > > > like
> > > >> >> > > > > > > beam
> > > >> >> > > > > > > > > (euphoria etc) and provide single, clean API's
> that
> > > >> may be
> > > >> >> > more
> > > >> >> > > > > > > > > maintainable from a code perspective. But at the
> > same
> > > >> time
> > > >> >> > this
> > > >> >> > > > > will
> > > >> >> > > > > > > > > introduce challenges on how to maintain
> efficiency
> > > and
> > > >> >> > > optimized
> > > >> >> > > > > > > runtime
> > > >> >> > > > > > > > > dags for Hudi (since the code would move away
> from
> > > >> point
> > > >> >> > > > > integrations
> > > >> >> > > > > > > and
> > > >> >> > > > > > > > > whenever this happens, tuning natively for
> specific
> > > >> >> engines
> > > >> >> > > > becomes
> > > >> >> > > > > > > more
> > > >> >> > > > > > > > > and more difficult).
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > My general opinion is that, as the community
> grows
> > > over
> > > >> >> time
> > > >> >> > > with
> > > >> >> > > > > > more
> > > >> >> > > > > > > > > folks having an in-depth understanding of Hudi,
> > going
> > > >> from
> > > >> >> > > > > > > current_state
> > > >> >> > > > > > > > ->
> > > >> >> > > > > > > > > (a) -> (b) might be the most reliable and
> adoptable
> > > >> path
> > > >> >> for
> > > >> >> > > this
> > > >> >> > > > > > > > project.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > Thanks,
> > > >> >> > > > > > > > > Nishith
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng <
> > > >> >> > > > > > n...@semanticbeeng.com>
> > > >> >> > > > > > > > > wrote:
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > > There are some not trivial difference between
> > > >> >> programming
> > > >> >> > > model
> > > >> >> > > > > and
> > > >> >> > > > > > > > > > runtime semantics between Beam, Spark and
> Flink.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Nitish, Vino - thoughts?
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Does it feel to consider a higher level
> > > abstraction /
> > > >> >> DSL
> > > >> >> > > > instead
> > > >> >> > > > > > of
> > > >> >> > > > > > > > > > maintaining different code with same
> > functionality
> > > >> but
> > > >> >> > > > different
> > > >> >> > > > > > > > > > programming models ?
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> https://beam.apache.org/documentation/sdks/java/euphoria/
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Nick
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On August 6, 2019 at 4:04 PM nishith agarwal <
> > > >> >> > > > > n3.nas...@gmail.com>
> > > >> >> > > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > +1 for Approach 1 Point integration with each
> > > >> framework.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Pros for point integration
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - Hudi community is already familiar with
> spark
> > > >> and
> > > >> >> > spark
> > > >> >> > > > > based
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > actions/shuffles etc. Since both modules can
> be
> > > >> >> decoupled,
> > > >> >> > > this
> > > >> >> > > > > > > enables
> > > >> >> > > > > > > > > us
> > > >> >> > > > > > > > > > to have a steady release for Hudi for 1
> execution
> > > >> engine
> > > >> >> > > > (spark)
> > > >> >> > > > > > > while
> > > >> >> > > > > > > > we
> > > >> >> > > > > > > > > > hone our skills and iterate on making flink
> dag
> > > >> >> optimized,
> > > >> >> > > > > > performant
> > > >> >> > > > > > > > > with
> > > >> >> > > > > > > > > > the right configuration.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - This might be a stepping stone towards
> > rewriting
> > > >> >> the
> > > >> >> > > > entire
> > > >> >> > > > > > code
> > > >> >> > > > > > > > > base
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > being agnostic of spark/flink. This approach
> will
> > > >> help
> > > >> >> us
> > > >> >> > fix
> > > >> >> > > > > > tests,
> > > >> >> > > > > > > > > > intricacies and help make the code base ready
> > for a
> > > >> >> larger
> > > >> >> > > > > rework.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - Seems like the easiest way to add flink
> support
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Cons
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - More code paths to maintain and reason since
> > the
> > > >> >> spark
> > > >> >> > > and
> > > >> >> > > > > > flink
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > integrations will naturally diverge over time.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Theoretically, I do like the idea of being
> able
> > to
> > > >> run
> > > >> >> the
> > > >> >> > > hudi
> > > >> >> > > > > dag
> > > >> >> > > > > > > on
> > > >> >> > > > > > > > > beam
> > > >> >> > > > > > > > > > more than point integrations, where there is
> one
> > > >> >> API/logic
> > > >> >> > to
> > > >> >> > > > > > reason
> > > >> >> > > > > > > > > about.
> > > >> >> > > > > > > > > > But practically, that may not be the right
> > > direction.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Pros
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - Lesser cognitive burden in maintaining,
> > evolving
> > > >> >> and
> > > >> >> > > > > releasing
> > > >> >> > > > > > > the
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > project with one API to reason with.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - Theoretically, going forward assuming beam
> is
> > > >> >> adopted
> > > >> >> > > as a
> > > >> >> > > > > > > > standard
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > programming paradigm for stream/batch, this
> would
> > > >> enable
> > > >> >> > > > > consumers
> > > >> >> > > > > > > > > leverage
> > > >> >> > > > > > > > > > the power of hudi more easily.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Cons
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - Massive rewrite of the code base.
> Additionally,
> > > >> >> since
> > > >> >> > we
> > > >> >> > > > > would
> > > >> >> > > > > > > > have
> > > >> >> > > > > > > > > > moved
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > away from directly using spark APIs, there is
> a
> > > >> bigger
> > > >> >> risk
> > > >> >> > > of
> > > >> >> > > > > > > > > regression.
> > > >> >> > > > > > > > > > We would have to be very thorough with all the
> > > >> >> intricacies
> > > >> >> > > and
> > > >> >> > > > > > ensure
> > > >> >> > > > > > > > the
> > > >> >> > > > > > > > > > same stability of new releases.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - Managing future features (which may be very
> > > >> spark
> > > >> >> > > driven)
> > > >> >> > > > > will
> > > >> >> > > > > > > > > either
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > clash or pause or will need to be reworked.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - Tuning jobs for Spark/Flink type execution
> > > >> >> frameworks
> > > >> >> > > > > > > individually
> > > >> >> > > > > > > > > > might
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > be difficult and will get difficult over time
> as
> > > the
> > > >> >> > project
> > > >> >> > > > > > evolves,
> > > >> >> > > > > > > > > where
> > > >> >> > > > > > > > > > some beam integrations with spark/flink may
> not
> > > work
> > > >> as
> > > >> >> > > > expected.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > - Also, as pointed above, need to probably
> > support
> > > >> >> the
> > > >> >> > > > > > > hoodie-spark
> > > >> >> > > > > > > > > > module
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > as a first-class.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Thank,
> > > >> >> > > > > > > > > > Nishith
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala
> <
> > > >> >> > > > > taher...@gmail.com
> > > >> >> > > > > > >
> > > >> >> > > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Hi Vinoth,
> > > >> >> > > > > > > > > > Are there some tasks I can take up to ramp up
> the
> > > >> code?
> > > >> >> > Want
> > > >> >> > > to
> > > >> >> > > > > get
> > > >> >> > > > > > > > > > more used to the code and understand the
> existing
> > > >> >> > > > implementation
> > > >> >> > > > > > > > better.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Thanks,
> > > >> >> > > > > > > > > > Taher Koitawala
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar <
> > > >> >> > > > vin...@apache.org>
> > > >> >> > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Let's see if others have any thoughts as well.
> We
> > > can
> > > >> >> plan
> > > >> >> > to
> > > >> >> > > > fix
> > > >> >> > > > > > the
> > > >> >> > > > > > > > > > approach by EOW.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <
> > > >> >> > > > yanghua1...@gmail.com>
> > > >> >> > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Hi guys,
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Also, +1 for Approach 1 like Taher.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > If we can do a comprehensive analysis of this
> > model
> > > >> and
> > > >> >> > come
> > > >> >> > > up
> > > >> >> > > > > > with.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > means
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > to refactor this cleanly, this would be
> > promising.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Yes, when we get the conclusion, we could
> start
> > > this
> > > >> >> work.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Best,
> > > >> >> > > > > > > > > > Vino
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > taher koitawala <taher...@gmail.com>
> > 于2019年8月6日周二
> > > >> >> > 上午12:28写道：
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > +1 for Approch 1 Point integration with each
> > > >> framework
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Approach 2 has a problem as you said
> "Developers
> > > >> need to
> > > >> >> > > think
> > > >> >> > > > > > about
> > > >> >> > > > > > > > > >
> > what-if-this-piece-of-code-ran-as-spark-vs-flink..
> > > >> So in
> > > >> >> > the
> > > >> >> > > > end,
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > this
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > may
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > not be the panacea that it seems to be"
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > We have seen various pipelines in the beam dag
> > > being
> > > >> >> > > expressed
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > differently
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > then we had them in our original usecase. And
> > also
> > > >> >> > switching
> > > >> >> > > > > > between
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > spark
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > and Flink runners in beam have various impact
> on
> > > the
> > > >> >> > > pipelines
> > > >> >> > > > > like
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > some
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > features available in Flink are not available
> on
> > > the
> > > >> >> spark
> > > >> >> > > > runner
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > etc.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Refer to this compatible matrix ->
> > > >> >> > > > > > > > > >
> > > >> >> > > >
> > > https://beam.apache.org/documentation/runners/capability-matrix/
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Hence my vote on Approch 1 let's decouple and
> > build
> > > >> the
> > > >> >> > > > abstract
> > > >> >> > > > > > for
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > each
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > framework. That is a much better option. We
> will
> > > also
> > > >> >> have
> > > >> >> > > more
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > control
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > over each framework's implement.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <
> > > >> >> > > vin...@apache.org
> > > >> >> > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Would like to highlight that there are two
> > distinct
> > > >> >> > > approaches
> > > >> >> > > > > here
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > with
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > different tradeoffs. Think of this as my
> > braindump,
> > > >> as I
> > > >> >> > have
> > > >> >> > > > > been
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > thinking
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > about this quite a bit in the past.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > *Approach 1 : Point integration with each
> > > framework *
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > We may need a pure client module named for
> > example
> > > >> >> > > > > > > > > > hoodie-client-core(common)
> > > >> >> > > > > > > > > > >> Then we could have: hoodie-client-spark,
> > > >> >> > > hoodie-client-flink
> > > >> >> > > > > > > > > > and hoodie-client-beam
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > (+) This is the safest to do IMO, since we can
> > > >> isolate
> > > >> >> the
> > > >> >> > > > > current
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Spark
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > execution (hoodie-spark, hoodie-client-spark)
> > from
> > > >> the
> > > >> >> > > changes
> > > >> >> > > > > for
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > flink,
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > while it stabilizes over few releases.
> > > >> >> > > > > > > > > > (-) Downside is that the utilities needs to be
> > > >> redone :
> > > >> >> > > > > > > > > > hoodie-utilities-spark and
> hoodie-utilities-flink
> > > and
> > > >> >> > > > > > > > > > hoodie-utilities-core ? hoodie-cli?
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > If we can do a comprehensive analysis of this
> > model
> > > >> and
> > > >> >> > come
> > > >> >> > > up
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > with.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > means
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > to refactor this cleanly, this would be
> > promising.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > *Approach 2: Beam as the compute abstraction*
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Another more drastic approach is to remove
> Spark
> > as
> > > >> the
> > > >> >> > > compute
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > abstraction
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > for writing data and replace it with Beam.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > (+) All of the code remains more or less
> similar
> > > and
> > > >> >> there
> > > >> >> > is
> > > >> >> > > > one
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > compute
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > API to reason about.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > (-) The (very big) assumption here is that we
> are
> > > >> able
> > > >> >> to
> > > >> >> > > tune
> > > >> >> > > > > the
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > spark
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > runtime the same way using Beam : custom
> > > >> partitioners,
> > > >> >> > > support
> > > >> >> > > > > for
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > all
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > RDD
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > operations we invoke, caching etc etc.
> > > >> >> > > > > > > > > > (-) It will be a massive rewrite and testing
> of
> > > such
> > > >> a
> > > >> >> > large
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > rewrite
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > would
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > also be really challenging, since we need to
> pay
> > > >> >> attention
> > > >> >> > to
> > > >> >> > > > all
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > intricate
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > details to ensure the spark users today
> > experience
> > > no
> > > >> >> > > > > > > > > > regressions/side-effects
> > > >> >> > > > > > > > > > (-) Note that we still need to probably
> support
> > the
> > > >> >> > > > hoodie-spark
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > module
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > and
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > may be a first-class such integration with
> flink,
> > > for
> > > >> >> > native
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > flink/spark
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > pipeline authoring. Users of say DeltaStreamer
> > need
> > > >> to
> > > >> >> pass
> > > >> >> > > in
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Spark
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > or
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Flink configs anyway.. Developers need to
> think
> > > about
> > > >> >> > > > > > > > > >
> > what-if-this-piece-of-code-ran-as-spark-vs-flink..
> > > >> So in
> > > >> >> > the
> > > >> >> > > > end,
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > this
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > may
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > not be the panacea that it seems to be.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > One goal for the HIP is to get us all to agree
> > as a
> > > >> >> > community
> > > >> >> > > > > which
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > one
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > to
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > pick, with sufficient investigation, testing,
> > > >> >> > benchmarking..
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <
> > > >> >> > > > yanghua1...@gmail.com>
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > +1 for both Beam and Flink
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > First step here is to probably draw out
> current
> > > >> >> hierrarchy
> > > >> >> > > and
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > figure
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > out
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > what the abstraction points are..
> > > >> >> > > > > > > > > > In my opinion, the runtime (spark, flink)
> should
> > be
> > > >> >> done at
> > > >> >> > > the
> > > >> >> > > > > > > > > > hoodie-client level and just used by
> > > hoodie-utilties
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > seamlessly..
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > +1 for Vinoth's opinion, it should be the
> first
> > > step.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > No matter we hope Hudi to integrate with which
> > > >> computing
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > framework.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > We need to decouple Hudi client and Spark.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > We may need a pure client module named for
> > example
> > > >> >> > > > > > > > > > hoodie-client-core(common)
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Then we could have: hoodie-client-spark,
> > > >> >> > hoodie-client-flink
> > > >> >> > > > and
> > > >> >> > > > > > > > > > hoodie-client-beam
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Suneel Marthi <smar...@apache.org>
> 于2019年8月4日周日
> > > >> >> 上午10:45写道：
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's
> > > analysis.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher
> koitawala <
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > taher...@gmail.com>
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > So the way to go around this is that file a
> hip.
> > > >> Chalk
> > > >> >> all
> > > >> >> > th
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > classes
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > our
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > and start moving towards Pure client.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Secondly should we want to try beam?
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > I think there is to much going on here and I'm
> > not
> > > >> able
> > > >> >> to
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > follow.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > If
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > we
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > want to try out beam all along I don't think
> it
> > > makes
> > > >> >> sense
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > to
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > do
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > anything
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > on Flink then.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > n...@semanticbeeng.com>
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >> +1 My money is on this approach.
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >> The existing abstractions from Beam seem
> > enough
> > > >> for
> > > >> >> the
> > > >> >> > > use
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > cases
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > as I
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > imagine them.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >> Flink also has "dynamic table", "table
> source"
> > > and
> > > >> >> > "table
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > sink"
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > which
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > seem very useful abstractions where Hudi might
> > fit
> > > >> >> nicely.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >> Attached a screen shot.
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >> This seems to fit with the original premise
> of
> > > >> Hudi
> > > >> >> as
> > > >> >> > > well.
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >> Am exploring this venue with a use case
> that
> > > >> involves
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > "temporal
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > joins
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > on
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > streams" which I need for feature extraction.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >> Anyone is interested in this or has
> concrete
> > > >> enough
> > > >> >> > needs
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > and
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > use
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > cases
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > please let me know.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >> Best to go from an agreed upon set of 2-3
> use
> > > >> cases.
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >> Cheers
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >> Nick
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >> > Also, we do have some Beam experts on the
> > > >> mailing
> > > >> >> > list..
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Can
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > you
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > please
> > > >> >> > > > > > > > > > >> weigh on viability of using Beam as the
> > > >> intermediate
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > abstraction
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > here
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > between Spark/Flink?
> > > >> >> > > > > > > > > > Hudi uses RDD apis like groupBy, mapToPair,
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > sortAndRepartition,
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > reduceByKey, countByKey and also does custom
> > > >> >> partitioning a
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > lot.>
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > >> >
> > > >> >> > > > > > > > > > >>
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >> >
> > > >>
> > > >
> > >
> > >
> >
>
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to