Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

Taher Koitawala Thu, 26 Sep 2019 09:56:22 -0700

Hi Vinoth,
      IMHO we should stick to Spark for micro batching for 2 reasons. 1:
Easy out use 2: Performance. Flink batch is not as fast as Spark. Also the
rich library of functions and the ease of integration which Spark has with
Hive etc that is not there in Flink batch.



Regards,
Taher Koitawala



On Thu, Sep 26, 2019, 10:11 PM Vinoth Chandar <[email protected]> wrote:

> How mature is Flink batch? I am wondering if we can with Flink Batch APIs.
> Hudi itself will provide micro-batching semantics (incremental pull,
> upserts)?
> For true streaming performace, I am not sure even Cloud stores are ready,
> since none of them support append()s.
> IMHO a mini/micro batch model makes sense for the kind of space Hudi
> solves.. We can keep working towards making this
>
> I have some thoughts on indexing, a new way to index, that complements
> BloomIndex and overcomes some of the spark caching needs.
> Will write it down over the weekend and share it..
>
> P.S: I am also wondering if we should generalize HIPs into a RFC
> process[1], that way it does not have to be a real technical proposal and
> we can still have structured conversation on future roadmap etc. and evolve
> the shape together..
> (May be this deserves its own separate DISCUSS thread)
>
>
> [1] - https://en.wikipedia.org/wiki/Request_for_Comments
>
>
> On Wed, Sep 25, 2019 at 9:07 PM Taher Koitawala <[email protected]>
> wrote:
>
> > Hi Vino,
> >          Agree with your suggestion. We all know when thought Flink is
> > streaming we can control how files get rolled out through checkpointing
> > configurations. Bad config and small files get rolled out. Good config
> and
> > files are properly sized.
> >
> >      Also I understand the concern of reading files with Flink and
> > performance related to it. (I have faced it before). So how about we
> built
> > our own functions which can read and write efficiently and is not a
> source
> > function but an operator! So what I mean is let's use the Akka based
> > semantics of passing messages between these operators and read what is
> > required.
> >
> > Have 1 custom stream operator who takes requests for reading files (Again
> > that is not a source function, it is an operator), that operator reads
> the
> > file and passes it downstream in a parallel manner. (May be AsyncIO
> > extension can be a better call here). Let me know your thoughts on this.
> > I.e: if we choose everything will be written on core DataStreams API
> >
> > As per the Flink batch and Stream talk goes. I guess as a community we
> have
> > already agreed that for batch the spark engine is good enough and streams
> > will be power by Flink where as beam will do both. By the time Flink
> > unification of batch and stream is completely our code will be batch
> > compatible with minimal changes.
> >
> > Further we need to plan what part of Flink API will we use, should we
> stick
> > to hard core DataStreams API or should we merge it with Table APIs(I
> think
> > we should) so that we get Append sink, retract sink, temporial tables,
> the
> > capability to use the new blink table planner, also it would to some
> extent
> > lift our work of reading files since the table ApI I believe is good at
> > doing those things and comes with in build readers and writers. So
> > basically if we use that we could also do instream join the stream source
> > and the Hudi files to recompute upserts etc. AFAIK Flink Table ApI is
> also
> > giving a .cache() like functionality now. (Saw conversations about it in
> > the mailing list)
> >
> > So I think we really need to start planning such things to move ahead.
> > Other than that 100% with Vino that we cannot read files otherwise in
> Flink
> > it would be really bad to do that.
> >
> > Regards,
> > Taher Koitawala
> >
> > On Thu, Sep 26, 2019, 8:47 AM vino yang <[email protected]> wrote:
> >
> > > Hi
> > >
> > > A simple example. In Hudi Project, you can find many code snippet like
> > > `spark.read().format().load()` The load method can pass any path,
> > > especially DFS paths.
> > >
> > > While if we only want to use Flink streaming, there is not a good way
> to
> > > read HDFS now.
> > >
> > > In addition, we.also need to consider other ability between Flink and
> > > Spark. You should know Spark API(non-structured streaming mode) can
> > support
> > > both Streaming(micro-batch) and batch. However, Flink distinguishs them
> > > with two differentAPI, they have different feature set.
> > >
> > >
> > >
> > > On 09/25/2019 13:15, Semantic Beeng <[email protected]> wrote:
> > >
> > > Hi Vino,
> > >
> > > Would you be kind to start a wiki page to discuss this deep
> understanding
> > > of the functionality and design of Hudi?
> > >
> > > There you can put git links (https://github.com/ben-gibson/GitLink for
> > > intellij) and design knowledge so we can discuss in context.
> > >
> > > I am exploring the approach from this retweet
> > > https://twitter.com/semanticbeeng/status/1176241250967666689?s=20 and
> > > need this understanding you have.
> > >
> > > "difficult to ignore Flink Batch API to match some features provide by
> > > Hudi now" - can you please post there some gitlinks to this?
> > >
> > > Thanks
> > >
> > > Nick
> > >
> > >
> > >
> > >
> > >
> > >
> > > On September 24, 2019 at 10:22 PM vino yang <[email protected]>
> > > wrote:
> > >
> > > Hi Taher,
> > >
> > > As I mentioned in the previous mail. Things may not be too easy by just
> > > using Flink state API.
> > >
> > > Copied here " Hudi can connect with many different Source/Sinks. Some
> > > file-based reads are not appropriate for Flink Streaming."
> > >
> > > Although, unify Batch and Streaming is Flink's goal. But, it is
> difficult
> > > to ignore Flink Batch API to match some features provide by Hudi now.
> > >
> > > The example you provided is in application layer about usage. So my
> > > suggestion is be patient, it needs time to give an detailed design.
> > >
> > > Best,
> > > Vino
> > >
> > >
> > >
> > > On 09/24/2019 22:38, Taher Koitawala <[email protected]> wrote:
> > > Hi All,
> > >              Sample code to see how records tagging will be handled in
> > > Flink is posted on [1]. The main class to run the same is MockHudi.java
> > > with a sample path for checkpointing.
> > >
> > > As of now this is just a sample to know we should ke caching in Flink
> > > states with bare minimum configs.
> > >
> > >
> > > As per my experience I have cached around 10s of TBs in Flink rocksDB
> > > state
> > > with the right configs. So I'm sure it should work here as well.
> > >
> > > 1:
> > >
> > >
> >
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
> > >
> > > Regards,
> > > Taher Koitawala
> > >
> > >
> > > On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar <[email protected]>
> wrote:
> > >
> > > > It wont be much different than the HBaseIndex we have today. Would
> like
> > > to
> > > > have always have an option like BloomIndex that does not need any
> > > external
> > > > dependencies.
> > > > The moment you bring an external data store in, someone becomes a
> DBA.
> > > :)
> > > >
> > > > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > @vc can you see how ApacheCrail could be used to implement this at
> > > scale
> > > > > but also in a way that abstracts over both Spark and Flink?
> > > > >
> > > > > "Crail Store implements a hierarchical namespace across a cluster
> of
> > > RDMA
> > > > > interconnected storage resources such as DRAM or flash"
> > > > >
> > > > > https://crail.incubator.apache.org/overview/
> > > > >
> > > > > + 2 cents
> > > > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20
> > > > >
> > > > > Cheers
> > > > >
> > > > > Nick
> > > > >
> > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar <[email protected]
> >
> > > > wrote:
> > > > >
> > > > >
> > > > > It could be much larger. :) imagine billions of keys each 32 bytes,
> > > > mapped
> > > > > to another 32 byte
> > > > >
> > > > > The advantage of the current bloom index is that its effectively
> > > stored
> > > > > with data itself and this reduces complexity in terms of keeping
> > index
> > > > and
> > > > > data consistent etc
> > > > >
> > > > > One orthogonal idea from long time ago that moves indexing out of
> > data
> > > > > storage and is generalizable
> > > > >
> > > > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index
> > > > >
> > > > > If someone here knows flink well and can implement some standalone
> > > flink
> > > > > code to mimic tagLocation() functionality and share with the group,
> > > that
> > > > > would be great. Lets worry about performance once we have a flink
> > DAG.
> > > I
> > > > > think this is a critical and most tricky piece in supporting flink.
> > > > >
> > > > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > Hi Taher,
> > > > >
> > > > > I agree with this , if the state is becoming too large we should
> have
> > > an
> > > > > option of storing it in external state like File System or RocksDb.
> > > > >
> > > > > @Vinoth Chandar <[email protected]> can the state of
> > HoodieBloomIndex
> > > go
> > > > > beyond 10-15 GB
> > > > >
> > > > > Regards,
> > > > > Vinay Patil
> > > > >
> > > > > >
> > > > >
> > > > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > >> Hey Guys, Any thoughts on the above idea? To handle
> > > HoodieBloomIndex
> > > > > with
> > > > > >> HeapState, RocksDBState and FsState but on Spark.
> > > > > >>
> > > > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <
> > [email protected]>
> > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Vinoth,
> > > > > >> > Having seen the doc and code. I understand the
> > > > > >> > HoodieBloomIndex mainly caches key and partition path. Can we
> > > > address
> > > > > >> how
> > > > > >> > Flink does it? Like, have HeapState where the user chooses to
> > > cache
> > > > > the
> > > > > >> > Index on heap, RockDBState where indexes are written to
> RocksDB
> > > and
> > > > > >> finally
> > > > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs.
> And
> > > on
> > > > > top,
> > > > > >> we
> > > > > >> > can do an index Time To Live.
> > > > > >> >
> > > > > >> > Regards,
> > > > > >> > Taher Koitawala
> > > > > >> >
> > > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar <
> > > [email protected]>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> I still feel the key thing here is reimplementing
> > > HoodieBloomIndex
> > > > > >> without
> > > > > >> >> needing spark caching.
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global
> > >
> > > > > )
> > > > > >> >> documents the spark DAG in detail.
> > > > > >> >>
> > > > > >> >> If everyone feels, it's best for me to scope the work out,
> then
> > > > happy
> > > > > >> to
> > > > > >> >> do
> > > > > >> >> it!
> > > > > >> >>
> > > > > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <
> > > > [email protected]
> > > > > >
> > > > > >> >> wrote:
> > > > > >> >>
> > > > > >> >> > Guys I think we are slowing down on this again. We need to
> > > start
> > > > > >> >> planning
> > > > > >> >> > small small tasks towards this VC please can you help fast
> > > track
> > > > > >> this?
> > > > > >> >> >
> > > > > >> >> > Regards,
> > > > > >> >> > Taher Koitawala
> > > > > >> >> >
> > > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <
> > > [email protected]
> > > > >
> > > > > >> >> wrote:
> > > > > >> >> >
> > > > > >> >> > > Look forward to the analysis. A key class to read would
> be
> > > > > >> >> > > HoodieBloomIndex, which uses a lot of spark caching and
> > > > shuffles.
> > > > > >> >> > >
> > > > > >> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <
> > > > [email protected]
> > > > > >
> > > > > >> >> wrote:
> > > > > >> >> > >
> > > > > >> >> > > > >> Currently Spark Streaming micro batching fits well
> > with
> > > > > Hudi,
> > > > > >> >> since
> > > > > >> >> > it
> > > > > >> >> > > > amortizes the cost of indexing, workload profiling
> etc. 1
> > > > spark
> > > > > >> >> micro
> > > > > >> >> > > batch
> > > > > >> >> > > > = 1 hudi commit
> > > > > >> >> > > > With the per-record model in Flink, I am not sure how
> > > useful
> > > > it
> > > > > >> >> will be
> > > > > >> >> > > to
> > > > > >> >> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi
> > > > commit,
> > > > > >> it
> > > > > >> >> will
> > > > > >> >> > > be
> > > > > >> >> > > > inefficient..
> > > > > >> >> > > >
> > > > > >> >> > > > Yes, if 1 input record = 1 hudi commit, it would be
> > > > > inefficient.
> > > > > >> >> About
> > > > > >> >> > > > Flink streaming, we can also implement the "batch" and
> > > > > >> "micro-batch"
> > > > > >> >> > > model
> > > > > >> >> > > > when process data. For example:
> > > > > >> >> > > >
> > > > > >> >> > > > - aggregation: use flexibility window mechanism;
> > > > > >> >> > > > - non-aggregation: use Flink stateful state API cache a
> > > batch
> > > > > >> >> data
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > > > >> On first focussing on decoupling of Spark and Hudi
> > > alone,
> > > > > yes
> > > > > >> a
> > > > > >> >> full
> > > > > >> >> > > > summary of how Spark is being used in a wiki page is a
> > > good
> > > > > start
> > > > > >> >> IMO.
> > > > > >> >> > We
> > > > > >> >> > > > can then hash out what can be generalized and what
> cannot
> > > be
> > > > > and
> > > > > >> >> needs
> > > > > >> >> > to
> > > > > >> >> > > > be left in hudi-client-spark vs hudi-client-core
> > > > > >> >> > > >
> > > > > >> >> > > > agree
> > > > > >> >> > > >
> > > > > >> >> > > > Vinoth Chandar <[email protected]> 于2019年8月14日周三
> > > 上午8:35写道：
> > > > > >> >> > > >
> > > > > >> >> > > > > >> We should only stick to Flink Streaming.
> Furthermore
> > > if
> > > > > >> there
> > > > > >> >> is a
> > > > > >> >> > > > > requirement for batch then users
> > > > > >> >> > > > > >> should use Spark or then we will anyway have a
> beam
> > > > > >> integration
> > > > > >> >> > > coming
> > > > > >> >> > > > > up.
> > > > > >> >> > > > >
> > > > > >> >> > > > > Currently Spark Streaming micro batching fits well
> with
> > > > Hudi,
> > > > > >> >> since
> > > > > >> >> > it
> > > > > >> >> > > > > amortizes the cost of indexing, workload profiling
> etc.
> > > 1
> > > > > spark
> > > > > >> >> micro
> > > > > >> >> > > > batch
> > > > > >> >> > > > > = 1 hudi commit
> > > > > >> >> > > > > With the per-record model in Flink, I am not sure how
> > > > useful
> > > > > it
> > > > > >> >> will
> > > > > >> >> > be
> > > > > >> >> > > > to
> > > > > >> >> > > > > support hudi.. for e.g, 1 input record cannot be 1
> hudi
> > > > > >> commit, it
> > > > > >> >> > will
> > > > > >> >> > > > be
> > > > > >> >> > > > > inefficient..
> > > > > >> >> > > > >
> > > > > >> >> > > > > On first focussing on decoupling of Spark and Hudi
> > > alone,
> > > > > yes a
> > > > > >> >> full
> > > > > >> >> > > > > summary of how Spark is being used in a wiki page is
> a
> > > good
> > > > > >> start
> > > > > >> >> > IMO.
> > > > > >> >> > > We
> > > > > >> >> > > > > can then hash out what can be generalized and what
> > > cannot
> > > > be
> > > > > >> and
> > > > > >> >> > needs
> > > > > >> >> > > to
> > > > > >> >> > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > >> >> > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <
> > > > > >> [email protected]>
> > > > > >> >> > > wrote:
> > > > > >> >> > > > >
> > > > > >> >> > > > > > Hi Nick and Taher,
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > I just want to answer Nishith's question. Reference
> > > his
> > > > old
> > > > > >> >> > > description
> > > > > >> >> > > > > > here:
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > > You can do a parallel investigation while we are
> > > > deciding
> > > > > >> on
> > > > > >> >> the
> > > > > >> >> > > > module
> > > > > >> >> > > > > > structure. You could be looking at all the patterns
> > in
> > > > > >> Hudi's
> > > > > >> >> > Spark
> > > > > >> >> > > > APIs
> > > > > >> >> > > > > > usage (RDD/DataSource/SparkContext) and see if such
> > > > support
> > > > > >> can
> > > > > >> >> be
> > > > > >> >> > > > > achieved
> > > > > >> >> > > > > > in theory with Flink. If not, what is the
> workaround.
> > > > > >> >> Documenting
> > > > > >> >> > > such
> > > > > >> >> > > > > > patterns would be valuable when multiple engineers
> > are
> > > > > >> working
> > > > > >> >> on
> > > > > >> >> > it.
> > > > > >> >> > > > For
> > > > > >> >> > > > > > e:g, Hudi relies on (a) custom partitioning logic
> for
> > > > > >> >> upserts,
> > > > > >> >> > > > > (b)
> > > > > >> >> > > > > > caching RDDs to avoid reruns of costly stages (c) A
> > > Spark
> > > > > >> >> > upsert
> > > > > >> >> > > > task
> > > > > >> >> > > > > > knowing its spark partition/task/attempt ids
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > And just like the title of this thread, we are
> going
> > > to
> > > > try
> > > > > >> to
> > > > > >> >> > > decouple
> > > > > >> >> > > > > > Hudi and Spark. That means we can run the whole
> Hudi
> > > > > without
> > > > > >> >> > > depending
> > > > > >> >> > > > > > Spark. So we need to analyze all the usage of Spark
> > in
> > > > > Hudi.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > Here we are not discussing the integration of Hudi
> > and
> > > > > Flink
> > > > > >> in
> > > > > >> >> the
> > > > > >> >> > > > > > application layer. Instead, I want Hudi to be
> > > decoupled
> > > > > from
> > > > > >> >> Spark
> > > > > >> >> > > and
> > > > > >> >> > > > > > allow other engines (such as Flink) to replace
> Spark.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > It can be divided into long-term goals and
> short-term
> > > > > goals.
> > > > > >> As
> > > > > >> >> > > Nishith
> > > > > >> >> > > > > > stated in a recent email.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > I mentioned the Flink Batch API here because Hudi
> can
> > > > > connect
> > > > > >> >> with
> > > > > >> >> > > many
> > > > > >> >> > > > > > different Source/Sinks. Some file-based reads are
> not
> > > > > >> >> appropriate
> > > > > >> >> > for
> > > > > >> >> > > > > Flink
> > > > > >> >> > > > > > Streaming.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > Therefore, this is a comprehensive survey of the
> use
> > > of
> > > > > >> Spark in
> > > > > >> >> > > Hudi.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > Best,
> > > > > >> >> > > > > > Vino
> > > > > >> >> > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > taher koitawala <[email protected]> 于2019年8月13日周二
> > > > > 下午5:43写道：
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > > Hi Vino,
> > > > > >> >> > > > > > > According to what I've seen Hudi has a lot of
> spark
> > > > > >> >> > component
> > > > > >> >> > > > > > flowing
> > > > > >> >> > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts
> > > etc.
> > > > > The
> > > > > >> >> main
> > > > > >> >> > > > > classes I
> > > > > >> >> > > > > > > guess we should focus upon is HoodieTable and
> > Hoodie
> > > > > write
> > > > > >> >> > clients.
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > Also Vino, I don't think we should be providing
> > > Flink
> > > > > >> dataset
> > > > > >> >> > > > > > > implementation. We should only stick to Flink
> > > > Streaming.
> > > > > >> >> > > > > > > Furthermore if there is a requirement for
> > > > > >> batch
> > > > > >> >> > then
> > > > > >> >> > > > > users
> > > > > >> >> > > > > > > should use Spark or then we will anyway have a
> beam
> > > > > >> >> integration
> > > > > >> >> > > > coming
> > > > > >> >> > > > > > up.
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > As of cache, How about we write our stateful
> Flink
> > > > > function
> > > > > >> >> and
> > > > > >> >> > use
> > > > > >> >> > > > > > > RocksDbStateBackend with some state TTL.
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang <
> > > > > >> >> [email protected]>
> > > > > >> >> > > > wrote:
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > > Hi all,
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > After doing some research, let me share my
> > > > information:
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > - Limitation of computing engine capabilities:
> > > Hudi
> > > > > >> uses
> > > > > >> >> > > Spark's
> > > > > >> >> > > > > > > > RDD#persist, and Flink currently has no API to
> > > cache
> > > > > >> >> > datasets.
> > > > > >> >> > > > > Maybe
> > > > > >> >> > > > > > > we
> > > > > >> >> > > > > > > > can
> > > > > >> >> > > > > > > > only choose to use external storage or do not
> use
> > > > > >> cache?
> > > > > >> >> For
> > > > > >> >> > > the
> > > > > >> >> > > > > use
> > > > > >> >> > > > > > > of
> > > > > >> >> > > > > > > > other APIs, the two currently offer almost
> > > equivalent
> > > > > >> >> > > > > capabilities.
> > > > > >> >> > > > > > > > - The abstraction of the computing engine is
> > > > > >> different:
> > > > > >> >> > > > > Considering
> > > > > >> >> > > > > > > the
> > > > > >> >> > > > > > > > different usage scenarios of the computing
> engine
> > > in
> > > > > >> >> Hudi,
> > > > > >> >> > > Flink
> > > > > >> >> > > > > has
> > > > > >> >> > > > > > > not
> > > > > >> >> > > > > > > > yet implemented stream batch unification, so we
> > > may
> > > > > >> use
> > > > > >> >> both
> > > > > >> >> > > > > Flink's
> > > > > >> >> > > > > > > > DataSet API (batch processing) and DataStream
> API
> > > > > >> (stream
> > > > > >> >> > > > > > processing).
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > Best,
> > > > > >> >> > > > > > > > Vino
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > nishith agarwal <[email protected]>
> > > 于2019年8月8日周四
> > > > > >> >> 上午12:57写道：
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > > Nick,
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > > > You bring up a good point about the
> non-trivial
> > > > > >> >> programming
> > > > > >> >> > > model
> > > > > >> >> > > > > > > > > differences between these different
> > > technologies.
> > > > > From
> > > > > >> a
> > > > > >> >> > > > > theoretical
> > > > > >> >> > > > > > > > > perspective, I'd say considering a higher
> level
> > > > > >> >> abstraction
> > > > > >> >> > > makes
> > > > > >> >> > > > > > > sense.
> > > > > >> >> > > > > > > > I
> > > > > >> >> > > > > > > > > think we have to decouple some objectives and
> > > > > concerns
> > > > > >> >> here.
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > > > a) The immediate desire is to have Hudi be
> able
> > > to
> > > > > run
> > > > > >> on
> > > > > >> >> a
> > > > > >> >> > > Flink
> > > > > >> >> > > > > (or
> > > > > >> >> > > > > > > > > non-spark) engine. This naturally begs the
> > > question
> > > > > of
> > > > > >> >> > > decoupling
> > > > > >> >> > > > > > Hudi
> > > > > >> >> > > > > > > > > concepts from direct Spark dependencies.
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > > > b) If we do want to initiate the above
> effort,
> > > > would
> > > > > it
> > > > > >> >> make
> > > > > >> >> > > > sense
> > > > > >> >> > > > > to
> > > > > >> >> > > > > > > > just
> > > > > >> >> > > > > > > > > have a higher level abstraction, building on
> > > other
> > > > > >> >> > technologies
> > > > > >> >> > > > > like
> > > > > >> >> > > > > > > beam
> > > > > >> >> > > > > > > > > (euphoria etc) and provide single, clean
> API's
> > > that
> > > > > >> may be
> > > > > >> >> > more
> > > > > >> >> > > > > > > > > maintainable from a code perspective. But at
> > the
> > > > same
> > > > > >> time
> > > > > >> >> > this
> > > > > >> >> > > > > will
> > > > > >> >> > > > > > > > > introduce challenges on how to maintain
> > > efficiency
> > > > > and
> > > > > >> >> > > optimized
> > > > > >> >> > > > > > > runtime
> > > > > >> >> > > > > > > > > dags for Hudi (since the code would move away
> > > from
> > > > > >> point
> > > > > >> >> > > > > integrations
> > > > > >> >> > > > > > > and
> > > > > >> >> > > > > > > > > whenever this happens, tuning natively for
> > > specific
> > > > > >> >> engines
> > > > > >> >> > > > becomes
> > > > > >> >> > > > > > > more
> > > > > >> >> > > > > > > > > and more difficult).
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > > > My general opinion is that, as the community
> > > grows
> > > > > over
> > > > > >> >> time
> > > > > >> >> > > with
> > > > > >> >> > > > > > more
> > > > > >> >> > > > > > > > > folks having an in-depth understanding of
> Hudi,
> > > > going
> > > > > >> from
> > > > > >> >> > > > > > > current_state
> > > > > >> >> > > > > > > > ->
> > > > > >> >> > > > > > > > > (a) -> (b) might be the most reliable and
> > > adoptable
> > > > > >> path
> > > > > >> >> for
> > > > > >> >> > > this
> > > > > >> >> > > > > > > > project.
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > > > Thanks,
> > > > > >> >> > > > > > > > > Nishith
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic
> Beeng <
> > > > > >> >> > > > > > [email protected]>
> > > > > >> >> > > > > > > > > wrote:
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > > > > There are some not trivial difference
> between
> > > > > >> >> programming
> > > > > >> >> > > model
> > > > > >> >> > > > > and
> > > > > >> >> > > > > > > > > > runtime semantics between Beam, Spark and
> > > Flink.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Nitish, Vino - thoughts?
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Does it feel to consider a higher level
> > > > > abstraction /
> > > > > >> >> DSL
> > > > > >> >> > > > instead
> > > > > >> >> > > > > > of
> > > > > >> >> > > > > > > > > > maintaining different code with same
> > > > functionality
> > > > > >> but
> > > > > >> >> > > > different
> > > > > >> >> > > > > > > > > > programming models ?
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> https://beam.apache.org/documentation/sdks/java/euphoria/
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Nick
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > On August 6, 2019 at 4:04 PM nishith
> agarwal
> > <
> > > > > >> >> > > > > [email protected]>
> > > > > >> >> > > > > > > > > wrote:
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > +1 for Approach 1 Point integration with
> each
> > > > > >> framework.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Pros for point integration
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - Hudi community is already familiar with
> > > spark
> > > > > >> and
> > > > > >> >> > spark
> > > > > >> >> > > > > based
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > actions/shuffles etc. Since both modules
> can
> > > be
> > > > > >> >> decoupled,
> > > > > >> >> > > this
> > > > > >> >> > > > > > > enables
> > > > > >> >> > > > > > > > > us
> > > > > >> >> > > > > > > > > > to have a steady release for Hudi for 1
> > > execution
> > > > > >> engine
> > > > > >> >> > > > (spark)
> > > > > >> >> > > > > > > while
> > > > > >> >> > > > > > > > we
> > > > > >> >> > > > > > > > > > hone our skills and iterate on making flink
> > > dag
> > > > > >> >> optimized,
> > > > > >> >> > > > > > performant
> > > > > >> >> > > > > > > > > with
> > > > > >> >> > > > > > > > > > the right configuration.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - This might be a stepping stone towards
> > > > rewriting
> > > > > >> >> the
> > > > > >> >> > > > entire
> > > > > >> >> > > > > > code
> > > > > >> >> > > > > > > > > base
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > being agnostic of spark/flink. This
> approach
> > > will
> > > > > >> help
> > > > > >> >> us
> > > > > >> >> > fix
> > > > > >> >> > > > > > tests,
> > > > > >> >> > > > > > > > > > intricacies and help make the code base
> ready
> > > > for a
> > > > > >> >> larger
> > > > > >> >> > > > > rework.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - Seems like the easiest way to add flink
> > > support
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Cons
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - More code paths to maintain and reason
> > since
> > > > the
> > > > > >> >> spark
> > > > > >> >> > > and
> > > > > >> >> > > > > > flink
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > integrations will naturally diverge over
> > time.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Theoretically, I do like the idea of being
> > > able
> > > > to
> > > > > >> run
> > > > > >> >> the
> > > > > >> >> > > hudi
> > > > > >> >> > > > > dag
> > > > > >> >> > > > > > > on
> > > > > >> >> > > > > > > > > beam
> > > > > >> >> > > > > > > > > > more than point integrations, where there
> is
> > > one
> > > > > >> >> API/logic
> > > > > >> >> > to
> > > > > >> >> > > > > > reason
> > > > > >> >> > > > > > > > > about.
> > > > > >> >> > > > > > > > > > But practically, that may not be the right
> > > > > direction.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Pros
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - Lesser cognitive burden in maintaining,
> > > > evolving
> > > > > >> >> and
> > > > > >> >> > > > > releasing
> > > > > >> >> > > > > > > the
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > project with one API to reason with.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - Theoretically, going forward assuming
> beam
> > > is
> > > > > >> >> adopted
> > > > > >> >> > > as a
> > > > > >> >> > > > > > > > standard
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > programming paradigm for stream/batch, this
> > > would
> > > > > >> enable
> > > > > >> >> > > > > consumers
> > > > > >> >> > > > > > > > > leverage
> > > > > >> >> > > > > > > > > > the power of hudi more easily.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Cons
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - Massive rewrite of the code base.
> > > Additionally,
> > > > > >> >> since
> > > > > >> >> > we
> > > > > >> >> > > > > would
> > > > > >> >> > > > > > > > have
> > > > > >> >> > > > > > > > > > moved
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > away from directly using spark APIs, there
> is
> > > a
> > > > > >> bigger
> > > > > >> >> risk
> > > > > >> >> > > of
> > > > > >> >> > > > > > > > > regression.
> > > > > >> >> > > > > > > > > > We would have to be very thorough with all
> > the
> > > > > >> >> intricacies
> > > > > >> >> > > and
> > > > > >> >> > > > > > ensure
> > > > > >> >> > > > > > > > the
> > > > > >> >> > > > > > > > > > same stability of new releases.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - Managing future features (which may be
> very
> > > > > >> spark
> > > > > >> >> > > driven)
> > > > > >> >> > > > > will
> > > > > >> >> > > > > > > > > either
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > clash or pause or will need to be reworked.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - Tuning jobs for Spark/Flink type
> execution
> > > > > >> >> frameworks
> > > > > >> >> > > > > > > individually
> > > > > >> >> > > > > > > > > > might
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > be difficult and will get difficult over
> time
> > > as
> > > > > the
> > > > > >> >> > project
> > > > > >> >> > > > > > evolves,
> > > > > >> >> > > > > > > > > where
> > > > > >> >> > > > > > > > > > some beam integrations with spark/flink may
> > > not
> > > > > work
> > > > > >> as
> > > > > >> >> > > > expected.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > - Also, as pointed above, need to probably
> > > > support
> > > > > >> >> the
> > > > > >> >> > > > > > > hoodie-spark
> > > > > >> >> > > > > > > > > > module
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > as a first-class.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Thank,
> > > > > >> >> > > > > > > > > > Nishith
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher
> > koitawala
> > > <
> > > > > >> >> > > > > [email protected]
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > > > wrote:
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Hi Vinoth,
> > > > > >> >> > > > > > > > > > Are there some tasks I can take up to ramp
> up
> > > the
> > > > > >> code?
> > > > > >> >> > Want
> > > > > >> >> > > to
> > > > > >> >> > > > > get
> > > > > >> >> > > > > > > > > > more used to the code and understand the
> > > existing
> > > > > >> >> > > > implementation
> > > > > >> >> > > > > > > > better.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Thanks,
> > > > > >> >> > > > > > > > > > Taher Koitawala
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth
> Chandar
> > <
> > > > > >> >> > > > [email protected]>
> > > > > >> >> > > > > > > > wrote:
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Let's see if others have any thoughts as
> > well.
> > > We
> > > > > can
> > > > > >> >> plan
> > > > > >> >> > to
> > > > > >> >> > > > fix
> > > > > >> >> > > > > > the
> > > > > >> >> > > > > > > > > > approach by EOW.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <
> > > > > >> >> > > > [email protected]>
> > > > > >> >> > > > > > > > wrote:
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Hi guys,
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Also, +1 for Approach 1 like Taher.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > If we can do a comprehensive analysis of
> this
> > > > model
> > > > > >> and
> > > > > >> >> > come
> > > > > >> >> > > up
> > > > > >> >> > > > > > with.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > means
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > to refactor this cleanly, this would be
> > > > promising.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Yes, when we get the conclusion, we could
> > > start
> > > > > this
> > > > > >> >> work.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Best,
> > > > > >> >> > > > > > > > > > Vino
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > taher koitawala <[email protected]>
> > > > 于2019年8月6日周二
> > > > > >> >> > 上午12:28写道：
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > +1 for Approch 1 Point integration with
> each
> > > > > >> framework
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Approach 2 has a problem as you said
> > > "Developers
> > > > > >> need to
> > > > > >> >> > > think
> > > > > >> >> > > > > > about
> > > > > >> >> > > > > > > > > >
> > > > what-if-this-piece-of-code-ran-as-spark-vs-flink..
> > > > > >> So in
> > > > > >> >> > the
> > > > > >> >> > > > end,
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > this
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > may
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > not be the panacea that it seems to be"
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > We have seen various pipelines in the beam
> > dag
> > > > > being
> > > > > >> >> > > expressed
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > differently
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > then we had them in our original usecase.
> And
> > > > also
> > > > > >> >> > switching
> > > > > >> >> > > > > > between
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > spark
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > and Flink runners in beam have various
> impact
> > > on
> > > > > the
> > > > > >> >> > > pipelines
> > > > > >> >> > > > > like
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > some
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > features available in Flink are not
> available
> > > on
> > > > > the
> > > > > >> >> spark
> > > > > >> >> > > > runner
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > etc.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Refer to this compatible matrix ->
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > >
> > > > > https://beam.apache.org/documentation/runners/capability-matrix/
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Hence my vote on Approch 1 let's decouple
> and
> > > > build
> > > > > >> the
> > > > > >> >> > > > abstract
> > > > > >> >> > > > > > for
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > each
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > framework. That is a much better option. We
> > > will
> > > > > also
> > > > > >> >> have
> > > > > >> >> > > more
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > control
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > over each framework's implement.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth
> Chandar <
> > > > > >> >> > > [email protected]
> > > > > >> >> > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > wrote:
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Would like to highlight that there are two
> > > > distinct
> > > > > >> >> > > approaches
> > > > > >> >> > > > > here
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > with
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > different tradeoffs. Think of this as my
> > > > braindump,
> > > > > >> as I
> > > > > >> >> > have
> > > > > >> >> > > > > been
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > thinking
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > about this quite a bit in the past.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > *Approach 1 : Point integration with each
> > > > > framework *
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > We may need a pure client module named for
> > > > example
> > > > > >> >> > > > > > > > > > hoodie-client-core(common)
> > > > > >> >> > > > > > > > > > >> Then we could have: hoodie-client-spark,
> > > > > >> >> > > hoodie-client-flink
> > > > > >> >> > > > > > > > > > and hoodie-client-beam
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > (+) This is the safest to do IMO, since we
> > can
> > > > > >> isolate
> > > > > >> >> the
> > > > > >> >> > > > > current
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Spark
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > execution (hoodie-spark,
> hoodie-client-spark)
> > > > from
> > > > > >> the
> > > > > >> >> > > changes
> > > > > >> >> > > > > for
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > flink,
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > while it stabilizes over few releases.
> > > > > >> >> > > > > > > > > > (-) Downside is that the utilities needs to
> > be
> > > > > >> redone :
> > > > > >> >> > > > > > > > > > hoodie-utilities-spark and
> > > hoodie-utilities-flink
> > > > > and
> > > > > >> >> > > > > > > > > > hoodie-utilities-core ? hoodie-cli?
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > If we can do a comprehensive analysis of
> this
> > > > model
> > > > > >> and
> > > > > >> >> > come
> > > > > >> >> > > up
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > with.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > means
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > to refactor this cleanly, this would be
> > > > promising.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > *Approach 2: Beam as the compute
> abstraction*
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Another more drastic approach is to remove
> > > Spark
> > > > as
> > > > > >> the
> > > > > >> >> > > compute
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > abstraction
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > for writing data and replace it with Beam.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > (+) All of the code remains more or less
> > > similar
> > > > > and
> > > > > >> >> there
> > > > > >> >> > is
> > > > > >> >> > > > one
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > compute
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > API to reason about.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > (-) The (very big) assumption here is that
> we
> > > are
> > > > > >> able
> > > > > >> >> to
> > > > > >> >> > > tune
> > > > > >> >> > > > > the
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > spark
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > runtime the same way using Beam : custom
> > > > > >> partitioners,
> > > > > >> >> > > support
> > > > > >> >> > > > > for
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > all
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > RDD
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > operations we invoke, caching etc etc.
> > > > > >> >> > > > > > > > > > (-) It will be a massive rewrite and
> testing
> > > of
> > > > > such
> > > > > >> a
> > > > > >> >> > large
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > rewrite
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > would
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > also be really challenging, since we need
> to
> > > pay
> > > > > >> >> attention
> > > > > >> >> > to
> > > > > >> >> > > > all
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > intricate
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > details to ensure the spark users today
> > > > experience
> > > > > no
> > > > > >> >> > > > > > > > > > regressions/side-effects
> > > > > >> >> > > > > > > > > > (-) Note that we still need to probably
> > > support
> > > > the
> > > > > >> >> > > > hoodie-spark
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > module
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > and
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > may be a first-class such integration with
> > > flink,
> > > > > for
> > > > > >> >> > native
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > flink/spark
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > pipeline authoring. Users of say
> > DeltaStreamer
> > > > need
> > > > > >> to
> > > > > >> >> pass
> > > > > >> >> > > in
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Spark
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > or
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Flink configs anyway.. Developers need to
> > > think
> > > > > about
> > > > > >> >> > > > > > > > > >
> > > > what-if-this-piece-of-code-ran-as-spark-vs-flink..
> > > > > >> So in
> > > > > >> >> > the
> > > > > >> >> > > > end,
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > this
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > may
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > not be the panacea that it seems to be.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > >
> > > > > >> >> > > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > One goal for the HIP is to get us all to
> > agree
> > > > as a
> > > > > >> >> > community
> > > > > >> >> > > > > which
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > one
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > to
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > pick, with sufficient investigation,
> testing,
> > > > > >> >> > benchmarking..
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <
> > > > > >> >> > > > [email protected]>
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > wrote:
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > +1 for both Beam and Flink
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > First step here is to probably draw out
> > > current
> > > > > >> >> hierrarchy
> > > > > >> >> > > and
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > figure
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > out
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > what the abstraction points are..
> > > > > >> >> > > > > > > > > > In my opinion, the runtime (spark, flink)
> > > should
> > > > be
> > > > > >> >> done at
> > > > > >> >> > > the
> > > > > >> >> > > > > > > > > > hoodie-client level and just used by
> > > > > hoodie-utilties
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > seamlessly..
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > +1 for Vinoth's opinion, it should be the
> > > first
> > > > > step.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > No matter we hope Hudi to integrate with
> > which
> > > > > >> computing
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > framework.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > We need to decouple Hudi client and Spark.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > We may need a pure client module named for
> > > > example
> > > > > >> >> > > > > > > > > > hoodie-client-core(common)
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Then we could have: hoodie-client-spark,
> > > > > >> >> > hoodie-client-flink
> > > > > >> >> > > > and
> > > > > >> >> > > > > > > > > > hoodie-client-beam
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Suneel Marthi <[email protected]>
> > > 于2019年8月4日周日
> > > > > >> >> 上午10:45写道：
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's
> > > > > analysis.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher
> > > koitawala <
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > [email protected]>
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > wrote:
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > So the way to go around this is that file a
> > > hip.
> > > > > >> Chalk
> > > > > >> >> all
> > > > > >> >> > th
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > classes
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > our
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > and start moving towards Pure client.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > Secondly should we want to try beam?
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > I think there is to much going on here and
> > I'm
> > > > not
> > > > > >> able
> > > > > >> >> to
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > follow.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > If
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > we
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > want to try out beam all along I don't
> think
> > > it
> > > > > makes
> > > > > >> >> sense
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > to
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > do
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > anything
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > on Flink then.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic
> Beeng <
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > [email protected]>
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > wrote:
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > >> +1 My money is on this approach.
> > > > > >> >> > > > > > > > > > >>
> > > > > >> >> > > > > > > > > > >> The existing abstractions from Beam seem
> > > > enough
> > > > > >> for
> > > > > >> >> the
> > > > > >> >> > > use
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > cases
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > as I
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > imagine them.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > >> Flink also has "dynamic table", "table
> > > source"
> > > > > and
> > > > > >> >> > "table
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > sink"
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > which
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > seem very useful abstractions where Hudi
> > might
> > > > fit
> > > > > >> >> nicely.
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > >>
> > > > > >> >> > > > > > > > > > >>
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > >
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > > > >> >> > > > > > > > > >
> > > > > >> >> > > > > > > > > > >>
> > > > > >> >> > > > > > > > > > >> Attached a screen shot.
> > > > > >> >> > > > > > > > > > >>
> > > > > >> >> > > > > > > > > > >> This seems to fit with the original
> > premise
> > > of
> > > > > >> Hudi
> > > > > >> >> as
> > > > > >> >> > > well.
> > > > > >> >> > > > > > > > > > >>
> > > > > >> >> >

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

Reply via email to