Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

Vinoth Chandar Thu, 26 Sep 2019 09:42:28 -0700

How mature is Flink batch? I am wondering if we can with Flink Batch APIs.
Hudi itself will provide micro-batching semantics (incremental pull,
upserts)?
For true streaming performace, I am not sure even Cloud stores are ready,
since none of them support append()s.
IMHO a mini/micro batch model makes sense for the kind of space Hudi
solves.. We can keep working towards making this


I have some thoughts on indexing, a new way to index, that complements
BloomIndex and overcomes some of the spark caching needs.
Will write it down over the weekend and share it..

P.S: I am also wondering if we should generalize HIPs into a RFC
process[1], that way it does not have to be a real technical proposal and
we can still have structured conversation on future roadmap etc. and evolve
the shape together..
(May be this deserves its own separate DISCUSS thread)


[1] - https://en.wikipedia.org/wiki/Request_for_Comments


On Wed, Sep 25, 2019 at 9:07 PM Taher Koitawala <[email protected]> wrote:

> Hi Vino,
>          Agree with your suggestion. We all know when thought Flink is
> streaming we can control how files get rolled out through checkpointing
> configurations. Bad config and small files get rolled out. Good config and
> files are properly sized.
>
>      Also I understand the concern of reading files with Flink and
> performance related to it. (I have faced it before). So how about we built
> our own functions which can read and write efficiently and is not a source
> function but an operator! So what I mean is let's use the Akka based
> semantics of passing messages between these operators and read what is
> required.
>
> Have 1 custom stream operator who takes requests for reading files (Again
> that is not a source function, it is an operator), that operator reads the
> file and passes it downstream in a parallel manner. (May be AsyncIO
> extension can be a better call here). Let me know your thoughts on this.
> I.e: if we choose everything will be written on core DataStreams API
>
> As per the Flink batch and Stream talk goes. I guess as a community we have
> already agreed that for batch the spark engine is good enough and streams
> will be power by Flink where as beam will do both. By the time Flink
> unification of batch and stream is completely our code will be batch
> compatible with minimal changes.
>
> Further we need to plan what part of Flink API will we use, should we stick
> to hard core DataStreams API or should we merge it with Table APIs(I think
> we should) so that we get Append sink, retract sink, temporial tables, the
> capability to use the new blink table planner, also it would to some extent
> lift our work of reading files since the table ApI I believe is good at
> doing those things and comes with in build readers and writers. So
> basically if we use that we could also do instream join the stream source
> and the Hudi files to recompute upserts etc. AFAIK Flink Table ApI is also
> giving a .cache() like functionality now. (Saw conversations about it in
> the mailing list)
>
> So I think we really need to start planning such things to move ahead.
> Other than that 100% with Vino that we cannot read files otherwise in Flink
> it would be really bad to do that.
>
> Regards,
> Taher Koitawala
>
> On Thu, Sep 26, 2019, 8:47 AM vino yang <[email protected]> wrote:
>
> > Hi
> >
> > A simple example. In Hudi Project, you can find many code snippet like
> > `spark.read().format().load()` The load method can pass any path,
> > especially DFS paths.
> >
> > While if we only want to use Flink streaming, there is not a good way to
> > read HDFS now.
> >
> > In addition, we.also need to consider other ability between Flink and
> > Spark. You should know Spark API(non-structured streaming mode) can
> support
> > both Streaming(micro-batch) and batch. However, Flink distinguishs them
> > with two differentAPI, they have different feature set.
> >
> >
> >
> > On 09/25/2019 13:15, Semantic Beeng <[email protected]> wrote:
> >
> > Hi Vino,
> >
> > Would you be kind to start a wiki page to discuss this deep understanding
> > of the functionality and design of Hudi?
> >
> > There you can put git links (https://github.com/ben-gibson/GitLink for
> > intellij) and design knowledge so we can discuss in context.
> >
> > I am exploring the approach from this retweet
> > https://twitter.com/semanticbeeng/status/1176241250967666689?s=20 and
> > need this understanding you have.
> >
> > "difficult to ignore Flink Batch API to match some features provide by
> > Hudi now" - can you please post there some gitlinks to this?
> >
> > Thanks
> >
> > Nick
> >
> >
> >
> >
> >
> >
> > On September 24, 2019 at 10:22 PM vino yang <[email protected]>
> > wrote:
> >
> > Hi Taher,
> >
> > As I mentioned in the previous mail. Things may not be too easy by just
> > using Flink state API.
> >
> > Copied here " Hudi can connect with many different Source/Sinks. Some
> > file-based reads are not appropriate for Flink Streaming."
> >
> > Although, unify Batch and Streaming is Flink's goal. But, it is difficult
> > to ignore Flink Batch API to match some features provide by Hudi now.
> >
> > The example you provided is in application layer about usage. So my
> > suggestion is be patient, it needs time to give an detailed design.
> >
> > Best,
> > Vino
> >
> >
> >
> > On 09/24/2019 22:38, Taher Koitawala <[email protected]> wrote:
> > Hi All,
> >              Sample code to see how records tagging will be handled in
> > Flink is posted on [1]. The main class to run the same is MockHudi.java
> > with a sample path for checkpointing.
> >
> > As of now this is just a sample to know we should ke caching in Flink
> > states with bare minimum configs.
> >
> >
> > As per my experience I have cached around 10s of TBs in Flink rocksDB
> > state
> > with the right configs. So I'm sure it should work here as well.
> >
> > 1:
> >
> >
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
> >
> > Regards,
> > Taher Koitawala
> >
> >
> > On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar <[email protected]> wrote:
> >
> > > It wont be much different than the HBaseIndex we have today. Would like
> > to
> > > have always have an option like BloomIndex that does not need any
> > external
> > > dependencies.
> > > The moment you bring an external data store in, someone becomes a DBA.
> > :)
> > >
> > > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng <[email protected]
> >
> > > wrote:
> > >
> > > > @vc can you see how ApacheCrail could be used to implement this at
> > scale
> > > > but also in a way that abstracts over both Spark and Flink?
> > > >
> > > > "Crail Store implements a hierarchical namespace across a cluster of
> > RDMA
> > > > interconnected storage resources such as DRAM or flash"
> > > >
> > > > https://crail.incubator.apache.org/overview/
> > > >
> > > > + 2 cents
> > > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20
> > > >
> > > > Cheers
> > > >
> > > > Nick
> > > >
> > > > On September 22, 2019 at 9:28 AM Vinoth Chandar <[email protected]>
> > > wrote:
> > > >
> > > >
> > > > It could be much larger. :) imagine billions of keys each 32 bytes,
> > > mapped
> > > > to another 32 byte
> > > >
> > > > The advantage of the current bloom index is that its effectively
> > stored
> > > > with data itself and this reduces complexity in terms of keeping
> index
> > > and
> > > > data consistent etc
> > > >
> > > > One orthogonal idea from long time ago that moves indexing out of
> data
> > > > storage and is generalizable
> > > >
> > > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index
> > > >
> > > > If someone here knows flink well and can implement some standalone
> > flink
> > > > code to mimic tagLocation() functionality and share with the group,
> > that
> > > > would be great. Lets worry about performance once we have a flink
> DAG.
> > I
> > > > think this is a critical and most tricky piece in supporting flink.
> > > >
> > > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil <[email protected]
> >
> > > > wrote:
> > > >
> > > > Hi Taher,
> > > >
> > > > I agree with this , if the state is becoming too large we should have
> > an
> > > > option of storing it in external state like File System or RocksDb.
> > > >
> > > > @Vinoth Chandar <[email protected]> can the state of
> HoodieBloomIndex
> > go
> > > > beyond 10-15 GB
> > > >
> > > > Regards,
> > > > Vinay Patil
> > > >
> > > > >
> > > >
> > > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <[email protected]
> >
> > > > wrote:
> > > >
> > > > >> Hey Guys, Any thoughts on the above idea? To handle
> > HoodieBloomIndex
> > > > with
> > > > >> HeapState, RocksDBState and FsState but on Spark.
> > > > >>
> > > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <
> [email protected]>
> >
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Vinoth,
> > > > >> > Having seen the doc and code. I understand the
> > > > >> > HoodieBloomIndex mainly caches key and partition path. Can we
> > > address
> > > > >> how
> > > > >> > Flink does it? Like, have HeapState where the user chooses to
> > cache
> > > > the
> > > > >> > Index on heap, RockDBState where indexes are written to RocksDB
> > and
> > > > >> finally
> > > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And
> > on
> > > > top,
> > > > >> we
> > > > >> > can do an index Time To Live.
> > > > >> >
> > > > >> > Regards,
> > > > >> > Taher Koitawala
> > > > >> >
> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar <
> > [email protected]>
> > > > >> wrote:
> > > > >> >
> > > > >> >> I still feel the key thing here is reimplementing
> > HoodieBloomIndex
> > > > >> without
> > > > >> >> needing spark caching.
> > > > >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global
> >
> > > > )
> > > > >> >> documents the spark DAG in detail.
> > > > >> >>
> > > > >> >> If everyone feels, it's best for me to scope the work out, then
> > > happy
> > > > >> to
> > > > >> >> do
> > > > >> >> it!
> > > > >> >>
> > > > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <
> > > [email protected]
> > > > >
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > Guys I think we are slowing down on this again. We need to
> > start
> > > > >> >> planning
> > > > >> >> > small small tasks towards this VC please can you help fast
> > track
> > > > >> this?
> > > > >> >> >
> > > > >> >> > Regards,
> > > > >> >> > Taher Koitawala
> > > > >> >> >
> > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <
> > [email protected]
> > > >
> > > > >> >> wrote:
> > > > >> >> >
> > > > >> >> > > Look forward to the analysis. A key class to read would be
> > > > >> >> > > HoodieBloomIndex, which uses a lot of spark caching and
> > > shuffles.
> > > > >> >> > >
> > > > >> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <
> > > [email protected]
> > > > >
> > > > >> >> wrote:
> > > > >> >> > >
> > > > >> >> > > > >> Currently Spark Streaming micro batching fits well
> with
> > > > Hudi,
> > > > >> >> since
> > > > >> >> > it
> > > > >> >> > > > amortizes the cost of indexing, workload profiling etc. 1
> > > spark
> > > > >> >> micro
> > > > >> >> > > batch
> > > > >> >> > > > = 1 hudi commit
> > > > >> >> > > > With the per-record model in Flink, I am not sure how
> > useful
> > > it
> > > > >> >> will be
> > > > >> >> > > to
> > > > >> >> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi
> > > commit,
> > > > >> it
> > > > >> >> will
> > > > >> >> > > be
> > > > >> >> > > > inefficient..
> > > > >> >> > > >
> > > > >> >> > > > Yes, if 1 input record = 1 hudi commit, it would be
> > > > inefficient.
> > > > >> >> About
> > > > >> >> > > > Flink streaming, we can also implement the "batch" and
> > > > >> "micro-batch"
> > > > >> >> > > model
> > > > >> >> > > > when process data. For example:
> > > > >> >> > > >
> > > > >> >> > > > - aggregation: use flexibility window mechanism;
> > > > >> >> > > > - non-aggregation: use Flink stateful state API cache a
> > batch
> > > > >> >> data
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > >> On first focussing on decoupling of Spark and Hudi
> > alone,
> > > > yes
> > > > >> a
> > > > >> >> full
> > > > >> >> > > > summary of how Spark is being used in a wiki page is a
> > good
> > > > start
> > > > >> >> IMO.
> > > > >> >> > We
> > > > >> >> > > > can then hash out what can be generalized and what cannot
> > be
> > > > and
> > > > >> >> needs
> > > > >> >> > to
> > > > >> >> > > > be left in hudi-client-spark vs hudi-client-core
> > > > >> >> > > >
> > > > >> >> > > > agree
> > > > >> >> > > >
> > > > >> >> > > > Vinoth Chandar <[email protected]> 于2019年8月14日周三
> > 上午8:35写道：
> > > > >> >> > > >
> > > > >> >> > > > > >> We should only stick to Flink Streaming. Furthermore
> > if
> > > > >> there
> > > > >> >> is a
> > > > >> >> > > > > requirement for batch then users
> > > > >> >> > > > > >> should use Spark or then we will anyway have a beam
> > > > >> integration
> > > > >> >> > > coming
> > > > >> >> > > > > up.
> > > > >> >> > > > >
> > > > >> >> > > > > Currently Spark Streaming micro batching fits well with
> > > Hudi,
> > > > >> >> since
> > > > >> >> > it
> > > > >> >> > > > > amortizes the cost of indexing, workload profiling etc.
> > 1
> > > > spark
> > > > >> >> micro
> > > > >> >> > > > batch
> > > > >> >> > > > > = 1 hudi commit
> > > > >> >> > > > > With the per-record model in Flink, I am not sure how
> > > useful
> > > > it
> > > > >> >> will
> > > > >> >> > be
> > > > >> >> > > > to
> > > > >> >> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi
> > > > >> commit, it
> > > > >> >> > will
> > > > >> >> > > > be
> > > > >> >> > > > > inefficient..
> > > > >> >> > > > >
> > > > >> >> > > > > On first focussing on decoupling of Spark and Hudi
> > alone,
> > > > yes a
> > > > >> >> full
> > > > >> >> > > > > summary of how Spark is being used in a wiki page is a
> > good
> > > > >> start
> > > > >> >> > IMO.
> > > > >> >> > > We
> > > > >> >> > > > > can then hash out what can be generalized and what
> > cannot
> > > be
> > > > >> and
> > > > >> >> > needs
> > > > >> >> > > to
> > > > >> >> > > > > be left in hudi-client-spark vs hudi-client-core
> > > > >> >> > > > >
> > > > >> >> > > > >
> > > > >> >> > > > >
> > > > >> >> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <
> > > > >> [email protected]>
> > > > >> >> > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > Hi Nick and Taher,
> > > > >> >> > > > > >
> > > > >> >> > > > > > I just want to answer Nishith's question. Reference
> > his
> > > old
> > > > >> >> > > description
> > > > >> >> > > > > > here:
> > > > >> >> > > > > >
> > > > >> >> > > > > > > You can do a parallel investigation while we are
> > > deciding
> > > > >> on
> > > > >> >> the
> > > > >> >> > > > module
> > > > >> >> > > > > > structure. You could be looking at all the patterns
> in
> > > > >> Hudi's
> > > > >> >> > Spark
> > > > >> >> > > > APIs
> > > > >> >> > > > > > usage (RDD/DataSource/SparkContext) and see if such
> > > support
> > > > >> can
> > > > >> >> be
> > > > >> >> > > > > achieved
> > > > >> >> > > > > > in theory with Flink. If not, what is the workaround.
> > > > >> >> Documenting
> > > > >> >> > > such
> > > > >> >> > > > > > patterns would be valuable when multiple engineers
> are
> > > > >> working
> > > > >> >> on
> > > > >> >> > it.
> > > > >> >> > > > For
> > > > >> >> > > > > > e:g, Hudi relies on (a) custom partitioning logic for
> > > > >> >> upserts,
> > > > >> >> > > > > (b)
> > > > >> >> > > > > > caching RDDs to avoid reruns of costly stages (c) A
> > Spark
> > > > >> >> > upsert
> > > > >> >> > > > task
> > > > >> >> > > > > > knowing its spark partition/task/attempt ids
> > > > >> >> > > > > >
> > > > >> >> > > > > > And just like the title of this thread, we are going
> > to
> > > try
> > > > >> to
> > > > >> >> > > decouple
> > > > >> >> > > > > > Hudi and Spark. That means we can run the whole Hudi
> > > > without
> > > > >> >> > > depending
> > > > >> >> > > > > > Spark. So we need to analyze all the usage of Spark
> in
> > > > Hudi.
> > > > >> >> > > > > >
> > > > >> >> > > > > > Here we are not discussing the integration of Hudi
> and
> > > > Flink
> > > > >> in
> > > > >> >> the
> > > > >> >> > > > > > application layer. Instead, I want Hudi to be
> > decoupled
> > > > from
> > > > >> >> Spark
> > > > >> >> > > and
> > > > >> >> > > > > > allow other engines (such as Flink) to replace Spark.
> > > > >> >> > > > > >
> > > > >> >> > > > > > It can be divided into long-term goals and short-term
> > > > goals.
> > > > >> As
> > > > >> >> > > Nishith
> > > > >> >> > > > > > stated in a recent email.
> > > > >> >> > > > > >
> > > > >> >> > > > > > I mentioned the Flink Batch API here because Hudi can
> > > > connect
> > > > >> >> with
> > > > >> >> > > many
> > > > >> >> > > > > > different Source/Sinks. Some file-based reads are not
> > > > >> >> appropriate
> > > > >> >> > for
> > > > >> >> > > > > Flink
> > > > >> >> > > > > > Streaming.
> > > > >> >> > > > > >
> > > > >> >> > > > > > Therefore, this is a comprehensive survey of the use
> > of
> > > > >> Spark in
> > > > >> >> > > Hudi.
> > > > >> >> > > > > >
> > > > >> >> > > > > > Best,
> > > > >> >> > > > > > Vino
> > > > >> >> > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > > taher koitawala <[email protected]> 于2019年8月13日周二
> > > > 下午5:43写道：
> > > > >> >> > > > > >
> > > > >> >> > > > > > > Hi Vino,
> > > > >> >> > > > > > > According to what I've seen Hudi has a lot of spark
> > > > >> >> > component
> > > > >> >> > > > > > flowing
> > > > >> >> > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts
> > etc.
> > > > The
> > > > >> >> main
> > > > >> >> > > > > classes I
> > > > >> >> > > > > > > guess we should focus upon is HoodieTable and
> Hoodie
> > > > write
> > > > >> >> > clients.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Also Vino, I don't think we should be providing
> > Flink
> > > > >> dataset
> > > > >> >> > > > > > > implementation. We should only stick to Flink
> > > Streaming.
> > > > >> >> > > > > > > Furthermore if there is a requirement for
> > > > >> batch
> > > > >> >> > then
> > > > >> >> > > > > users
> > > > >> >> > > > > > > should use Spark or then we will anyway have a beam
> > > > >> >> integration
> > > > >> >> > > > coming
> > > > >> >> > > > > > up.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > As of cache, How about we write our stateful Flink
> > > > function
> > > > >> >> and
> > > > >> >> > use
> > > > >> >> > > > > > > RocksDbStateBackend with some state TTL.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang <
> > > > >> >> [email protected]>
> > > > >> >> > > > wrote:
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > Hi all,
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > After doing some research, let me share my
> > > information:
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > - Limitation of computing engine capabilities:
> > Hudi
> > > > >> uses
> > > > >> >> > > Spark's
> > > > >> >> > > > > > > > RDD#persist, and Flink currently has no API to
> > cache
> > > > >> >> > datasets.
> > > > >> >> > > > > Maybe
> > > > >> >> > > > > > > we
> > > > >> >> > > > > > > > can
> > > > >> >> > > > > > > > only choose to use external storage or do not use
> > > > >> cache?
> > > > >> >> For
> > > > >> >> > > the
> > > > >> >> > > > > use
> > > > >> >> > > > > > > of
> > > > >> >> > > > > > > > other APIs, the two currently offer almost
> > equivalent
> > > > >> >> > > > > capabilities.
> > > > >> >> > > > > > > > - The abstraction of the computing engine is
> > > > >> different:
> > > > >> >> > > > > Considering
> > > > >> >> > > > > > > the
> > > > >> >> > > > > > > > different usage scenarios of the computing engine
> > in
> > > > >> >> Hudi,
> > > > >> >> > > Flink
> > > > >> >> > > > > has
> > > > >> >> > > > > > > not
> > > > >> >> > > > > > > > yet implemented stream batch unification, so we
> > may
> > > > >> use
> > > > >> >> both
> > > > >> >> > > > > Flink's
> > > > >> >> > > > > > > > DataSet API (batch processing) and DataStream API
> > > > >> (stream
> > > > >> >> > > > > > processing).
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Best,
> > > > >> >> > > > > > > > Vino
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > nishith agarwal <[email protected]>
> > 于2019年8月8日周四
> > > > >> >> 上午12:57写道：
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > > Nick,
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > You bring up a good point about the non-trivial
> > > > >> >> programming
> > > > >> >> > > model
> > > > >> >> > > > > > > > > differences between these different
> > technologies.
> > > > From
> > > > >> a
> > > > >> >> > > > > theoretical
> > > > >> >> > > > > > > > > perspective, I'd say considering a higher level
> > > > >> >> abstraction
> > > > >> >> > > makes
> > > > >> >> > > > > > > sense.
> > > > >> >> > > > > > > > I
> > > > >> >> > > > > > > > > think we have to decouple some objectives and
> > > > concerns
> > > > >> >> here.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > a) The immediate desire is to have Hudi be able
> > to
> > > > run
> > > > >> on
> > > > >> >> a
> > > > >> >> > > Flink
> > > > >> >> > > > > (or
> > > > >> >> > > > > > > > > non-spark) engine. This naturally begs the
> > question
> > > > of
> > > > >> >> > > decoupling
> > > > >> >> > > > > > Hudi
> > > > >> >> > > > > > > > > concepts from direct Spark dependencies.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > b) If we do want to initiate the above effort,
> > > would
> > > > it
> > > > >> >> make
> > > > >> >> > > > sense
> > > > >> >> > > > > to
> > > > >> >> > > > > > > > just
> > > > >> >> > > > > > > > > have a higher level abstraction, building on
> > other
> > > > >> >> > technologies
> > > > >> >> > > > > like
> > > > >> >> > > > > > > beam
> > > > >> >> > > > > > > > > (euphoria etc) and provide single, clean API's
> > that
> > > > >> may be
> > > > >> >> > more
> > > > >> >> > > > > > > > > maintainable from a code perspective. But at
> the
> > > same
> > > > >> time
> > > > >> >> > this
> > > > >> >> > > > > will
> > > > >> >> > > > > > > > > introduce challenges on how to maintain
> > efficiency
> > > > and
> > > > >> >> > > optimized
> > > > >> >> > > > > > > runtime
> > > > >> >> > > > > > > > > dags for Hudi (since the code would move away
> > from
> > > > >> point
> > > > >> >> > > > > integrations
> > > > >> >> > > > > > > and
> > > > >> >> > > > > > > > > whenever this happens, tuning natively for
> > specific
> > > > >> >> engines
> > > > >> >> > > > becomes
> > > > >> >> > > > > > > more
> > > > >> >> > > > > > > > > and more difficult).
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > My general opinion is that, as the community
> > grows
> > > > over
> > > > >> >> time
> > > > >> >> > > with
> > > > >> >> > > > > > more
> > > > >> >> > > > > > > > > folks having an in-depth understanding of Hudi,
> > > going
> > > > >> from
> > > > >> >> > > > > > > current_state
> > > > >> >> > > > > > > > ->
> > > > >> >> > > > > > > > > (a) -> (b) might be the most reliable and
> > adoptable
> > > > >> path
> > > > >> >> for
> > > > >> >> > > this
> > > > >> >> > > > > > > > project.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > Thanks,
> > > > >> >> > > > > > > > > Nishith
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng <
> > > > >> >> > > > > > [email protected]>
> > > > >> >> > > > > > > > > wrote:
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > > There are some not trivial difference between
> > > > >> >> programming
> > > > >> >> > > model
> > > > >> >> > > > > and
> > > > >> >> > > > > > > > > > runtime semantics between Beam, Spark and
> > Flink.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Nitish, Vino - thoughts?
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Does it feel to consider a higher level
> > > > abstraction /
> > > > >> >> DSL
> > > > >> >> > > > instead
> > > > >> >> > > > > > of
> > > > >> >> > > > > > > > > > maintaining different code with same
> > > functionality
> > > > >> but
> > > > >> >> > > > different
> > > > >> >> > > > > > > > > > programming models ?
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> https://beam.apache.org/documentation/sdks/java/euphoria/
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Nick
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On August 6, 2019 at 4:04 PM nishith agarwal
> <
> > > > >> >> > > > > [email protected]>
> > > > >> >> > > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > +1 for Approach 1 Point integration with each
> > > > >> framework.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Pros for point integration
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - Hudi community is already familiar with
> > spark
> > > > >> and
> > > > >> >> > spark
> > > > >> >> > > > > based
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > actions/shuffles etc. Since both modules can
> > be
> > > > >> >> decoupled,
> > > > >> >> > > this
> > > > >> >> > > > > > > enables
> > > > >> >> > > > > > > > > us
> > > > >> >> > > > > > > > > > to have a steady release for Hudi for 1
> > execution
> > > > >> engine
> > > > >> >> > > > (spark)
> > > > >> >> > > > > > > while
> > > > >> >> > > > > > > > we
> > > > >> >> > > > > > > > > > hone our skills and iterate on making flink
> > dag
> > > > >> >> optimized,
> > > > >> >> > > > > > performant
> > > > >> >> > > > > > > > > with
> > > > >> >> > > > > > > > > > the right configuration.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - This might be a stepping stone towards
> > > rewriting
> > > > >> >> the
> > > > >> >> > > > entire
> > > > >> >> > > > > > code
> > > > >> >> > > > > > > > > base
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > being agnostic of spark/flink. This approach
> > will
> > > > >> help
> > > > >> >> us
> > > > >> >> > fix
> > > > >> >> > > > > > tests,
> > > > >> >> > > > > > > > > > intricacies and help make the code base ready
> > > for a
> > > > >> >> larger
> > > > >> >> > > > > rework.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - Seems like the easiest way to add flink
> > support
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Cons
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - More code paths to maintain and reason
> since
> > > the
> > > > >> >> spark
> > > > >> >> > > and
> > > > >> >> > > > > > flink
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > integrations will naturally diverge over
> time.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Theoretically, I do like the idea of being
> > able
> > > to
> > > > >> run
> > > > >> >> the
> > > > >> >> > > hudi
> > > > >> >> > > > > dag
> > > > >> >> > > > > > > on
> > > > >> >> > > > > > > > > beam
> > > > >> >> > > > > > > > > > more than point integrations, where there is
> > one
> > > > >> >> API/logic
> > > > >> >> > to
> > > > >> >> > > > > > reason
> > > > >> >> > > > > > > > > about.
> > > > >> >> > > > > > > > > > But practically, that may not be the right
> > > > direction.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Pros
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - Lesser cognitive burden in maintaining,
> > > evolving
> > > > >> >> and
> > > > >> >> > > > > releasing
> > > > >> >> > > > > > > the
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > project with one API to reason with.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - Theoretically, going forward assuming beam
> > is
> > > > >> >> adopted
> > > > >> >> > > as a
> > > > >> >> > > > > > > > standard
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > programming paradigm for stream/batch, this
> > would
> > > > >> enable
> > > > >> >> > > > > consumers
> > > > >> >> > > > > > > > > leverage
> > > > >> >> > > > > > > > > > the power of hudi more easily.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Cons
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - Massive rewrite of the code base.
> > Additionally,
> > > > >> >> since
> > > > >> >> > we
> > > > >> >> > > > > would
> > > > >> >> > > > > > > > have
> > > > >> >> > > > > > > > > > moved
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > away from directly using spark APIs, there is
> > a
> > > > >> bigger
> > > > >> >> risk
> > > > >> >> > > of
> > > > >> >> > > > > > > > > regression.
> > > > >> >> > > > > > > > > > We would have to be very thorough with all
> the
> > > > >> >> intricacies
> > > > >> >> > > and
> > > > >> >> > > > > > ensure
> > > > >> >> > > > > > > > the
> > > > >> >> > > > > > > > > > same stability of new releases.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - Managing future features (which may be very
> > > > >> spark
> > > > >> >> > > driven)
> > > > >> >> > > > > will
> > > > >> >> > > > > > > > > either
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > clash or pause or will need to be reworked.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - Tuning jobs for Spark/Flink type execution
> > > > >> >> frameworks
> > > > >> >> > > > > > > individually
> > > > >> >> > > > > > > > > > might
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > be difficult and will get difficult over time
> > as
> > > > the
> > > > >> >> > project
> > > > >> >> > > > > > evolves,
> > > > >> >> > > > > > > > > where
> > > > >> >> > > > > > > > > > some beam integrations with spark/flink may
> > not
> > > > work
> > > > >> as
> > > > >> >> > > > expected.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > - Also, as pointed above, need to probably
> > > support
> > > > >> >> the
> > > > >> >> > > > > > > hoodie-spark
> > > > >> >> > > > > > > > > > module
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > as a first-class.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Thank,
> > > > >> >> > > > > > > > > > Nishith
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher
> koitawala
> > <
> > > > >> >> > > > > [email protected]
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Hi Vinoth,
> > > > >> >> > > > > > > > > > Are there some tasks I can take up to ramp up
> > the
> > > > >> code?
> > > > >> >> > Want
> > > > >> >> > > to
> > > > >> >> > > > > get
> > > > >> >> > > > > > > > > > more used to the code and understand the
> > existing
> > > > >> >> > > > implementation
> > > > >> >> > > > > > > > better.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Thanks,
> > > > >> >> > > > > > > > > > Taher Koitawala
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar
> <
> > > > >> >> > > > [email protected]>
> > > > >> >> > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Let's see if others have any thoughts as
> well.
> > We
> > > > can
> > > > >> >> plan
> > > > >> >> > to
> > > > >> >> > > > fix
> > > > >> >> > > > > > the
> > > > >> >> > > > > > > > > > approach by EOW.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <
> > > > >> >> > > > [email protected]>
> > > > >> >> > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Hi guys,
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Also, +1 for Approach 1 like Taher.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > If we can do a comprehensive analysis of this
> > > model
> > > > >> and
> > > > >> >> > come
> > > > >> >> > > up
> > > > >> >> > > > > > with.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > means
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > to refactor this cleanly, this would be
> > > promising.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Yes, when we get the conclusion, we could
> > start
> > > > this
> > > > >> >> work.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Best,
> > > > >> >> > > > > > > > > > Vino
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > taher koitawala <[email protected]>
> > > 于2019年8月6日周二
> > > > >> >> > 上午12:28写道：
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > +1 for Approch 1 Point integration with each
> > > > >> framework
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Approach 2 has a problem as you said
> > "Developers
> > > > >> need to
> > > > >> >> > > think
> > > > >> >> > > > > > about
> > > > >> >> > > > > > > > > >
> > > what-if-this-piece-of-code-ran-as-spark-vs-flink..
> > > > >> So in
> > > > >> >> > the
> > > > >> >> > > > end,
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > this
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > may
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > not be the panacea that it seems to be"
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > We have seen various pipelines in the beam
> dag
> > > > being
> > > > >> >> > > expressed
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > differently
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > then we had them in our original usecase. And
> > > also
> > > > >> >> > switching
> > > > >> >> > > > > > between
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > spark
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > and Flink runners in beam have various impact
> > on
> > > > the
> > > > >> >> > > pipelines
> > > > >> >> > > > > like
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > some
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > features available in Flink are not available
> > on
> > > > the
> > > > >> >> spark
> > > > >> >> > > > runner
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > etc.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Refer to this compatible matrix ->
> > > > >> >> > > > > > > > > >
> > > > >> >> > > >
> > > > https://beam.apache.org/documentation/runners/capability-matrix/
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Hence my vote on Approch 1 let's decouple and
> > > build
> > > > >> the
> > > > >> >> > > > abstract
> > > > >> >> > > > > > for
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > each
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > framework. That is a much better option. We
> > will
> > > > also
> > > > >> >> have
> > > > >> >> > > more
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > control
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > over each framework's implement.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <
> > > > >> >> > > [email protected]
> > > > >> >> > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Would like to highlight that there are two
> > > distinct
> > > > >> >> > > approaches
> > > > >> >> > > > > here
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > with
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > different tradeoffs. Think of this as my
> > > braindump,
> > > > >> as I
> > > > >> >> > have
> > > > >> >> > > > > been
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > thinking
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > about this quite a bit in the past.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > *Approach 1 : Point integration with each
> > > > framework *
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > We may need a pure client module named for
> > > example
> > > > >> >> > > > > > > > > > hoodie-client-core(common)
> > > > >> >> > > > > > > > > > >> Then we could have: hoodie-client-spark,
> > > > >> >> > > hoodie-client-flink
> > > > >> >> > > > > > > > > > and hoodie-client-beam
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > (+) This is the safest to do IMO, since we
> can
> > > > >> isolate
> > > > >> >> the
> > > > >> >> > > > > current
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Spark
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > execution (hoodie-spark, hoodie-client-spark)
> > > from
> > > > >> the
> > > > >> >> > > changes
> > > > >> >> > > > > for
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > flink,
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > while it stabilizes over few releases.
> > > > >> >> > > > > > > > > > (-) Downside is that the utilities needs to
> be
> > > > >> redone :
> > > > >> >> > > > > > > > > > hoodie-utilities-spark and
> > hoodie-utilities-flink
> > > > and
> > > > >> >> > > > > > > > > > hoodie-utilities-core ? hoodie-cli?
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > If we can do a comprehensive analysis of this
> > > model
> > > > >> and
> > > > >> >> > come
> > > > >> >> > > up
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > with.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > means
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > to refactor this cleanly, this would be
> > > promising.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > *Approach 2: Beam as the compute abstraction*
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Another more drastic approach is to remove
> > Spark
> > > as
> > > > >> the
> > > > >> >> > > compute
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > abstraction
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > for writing data and replace it with Beam.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > (+) All of the code remains more or less
> > similar
> > > > and
> > > > >> >> there
> > > > >> >> > is
> > > > >> >> > > > one
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > compute
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > API to reason about.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > (-) The (very big) assumption here is that we
> > are
> > > > >> able
> > > > >> >> to
> > > > >> >> > > tune
> > > > >> >> > > > > the
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > spark
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > runtime the same way using Beam : custom
> > > > >> partitioners,
> > > > >> >> > > support
> > > > >> >> > > > > for
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > all
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > RDD
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > operations we invoke, caching etc etc.
> > > > >> >> > > > > > > > > > (-) It will be a massive rewrite and testing
> > of
> > > > such
> > > > >> a
> > > > >> >> > large
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > rewrite
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > would
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > also be really challenging, since we need to
> > pay
> > > > >> >> attention
> > > > >> >> > to
> > > > >> >> > > > all
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > intricate
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > details to ensure the spark users today
> > > experience
> > > > no
> > > > >> >> > > > > > > > > > regressions/side-effects
> > > > >> >> > > > > > > > > > (-) Note that we still need to probably
> > support
> > > the
> > > > >> >> > > > hoodie-spark
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > module
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > and
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > may be a first-class such integration with
> > flink,
> > > > for
> > > > >> >> > native
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > flink/spark
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > pipeline authoring. Users of say
> DeltaStreamer
> > > need
> > > > >> to
> > > > >> >> pass
> > > > >> >> > > in
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Spark
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > or
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Flink configs anyway.. Developers need to
> > think
> > > > about
> > > > >> >> > > > > > > > > >
> > > what-if-this-piece-of-code-ran-as-spark-vs-flink..
> > > > >> So in
> > > > >> >> > the
> > > > >> >> > > > end,
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > this
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > may
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > not be the panacea that it seems to be.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > One goal for the HIP is to get us all to
> agree
> > > as a
> > > > >> >> > community
> > > > >> >> > > > > which
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > one
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > to
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > pick, with sufficient investigation, testing,
> > > > >> >> > benchmarking..
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <
> > > > >> >> > > > [email protected]>
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > +1 for both Beam and Flink
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > First step here is to probably draw out
> > current
> > > > >> >> hierrarchy
> > > > >> >> > > and
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > figure
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > out
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > what the abstraction points are..
> > > > >> >> > > > > > > > > > In my opinion, the runtime (spark, flink)
> > should
> > > be
> > > > >> >> done at
> > > > >> >> > > the
> > > > >> >> > > > > > > > > > hoodie-client level and just used by
> > > > hoodie-utilties
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > seamlessly..
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > +1 for Vinoth's opinion, it should be the
> > first
> > > > step.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > No matter we hope Hudi to integrate with
> which
> > > > >> computing
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > framework.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > We need to decouple Hudi client and Spark.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > We may need a pure client module named for
> > > example
> > > > >> >> > > > > > > > > > hoodie-client-core(common)
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Then we could have: hoodie-client-spark,
> > > > >> >> > hoodie-client-flink
> > > > >> >> > > > and
> > > > >> >> > > > > > > > > > hoodie-client-beam
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Suneel Marthi <[email protected]>
> > 于2019年8月4日周日
> > > > >> >> 上午10:45写道：
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's
> > > > analysis.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher
> > koitawala <
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > [email protected]>
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > So the way to go around this is that file a
> > hip.
> > > > >> Chalk
> > > > >> >> all
> > > > >> >> > th
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > classes
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > our
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > and start moving towards Pure client.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Secondly should we want to try beam?
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > I think there is to much going on here and
> I'm
> > > not
> > > > >> able
> > > > >> >> to
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > follow.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > If
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > we
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > want to try out beam all along I don't think
> > it
> > > > makes
> > > > >> >> sense
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > to
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > do
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > anything
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > on Flink then.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > [email protected]>
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >> +1 My money is on this approach.
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >> The existing abstractions from Beam seem
> > > enough
> > > > >> for
> > > > >> >> the
> > > > >> >> > > use
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > cases
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > as I
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > imagine them.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >> Flink also has "dynamic table", "table
> > source"
> > > > and
> > > > >> >> > "table
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > sink"
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > which
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > seem very useful abstractions where Hudi
> might
> > > fit
> > > > >> >> nicely.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >> Attached a screen shot.
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >> This seems to fit with the original
> premise
> > of
> > > > >> Hudi
> > > > >> >> as
> > > > >> >> > > well.
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >> Am exploring this venue with a use case
> > that
> > > > >> involves
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > "temporal
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > joins
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > on
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > streams" which I need for feature extraction.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >> Anyone is interested in this or has
> > concrete
> > > > >> enough
> > > > >> >> > needs
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > and
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > use
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > cases
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > please let me know.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >> Best to go from an agreed upon set of 2-3
> > use
> > > > >> cases.
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >> Cheers
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >> Nick
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >> > Also, we do have some Beam experts on
> the
> > > > >> mailing
> > > > >> >> > list..
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Can
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > you
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > please
> > > > >> >> > > > > > > > > > >> weigh on viability of using Beam as the
> > > > >> intermediate
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > abstraction
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > here
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > between Spark/Flink?
> > > > >> >> > > > > > > > > > Hudi uses RDD apis like groupBy, mapToPair,
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > sortAndRepartition,
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > reduceByKey, countByKey and also does custom
> > > > >> >> partitioning a
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > lot.>
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > >> >
> > > > >> >> > > > > > > > > > >>
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
> >
> >
> >
> >
>

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

Reply via email to