Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-10-02 Thread Vinoth Chandar
Based on some conversations I had with Flink folks including Hudi's very
own mentor Thomas, it seems future proof to look into supporting the Flink
streaming APIs. The batch apis IIUC will move towards converging with
Streaming APIs, which matches Hudi's model anyway

>From Hudi's perspective, following are the major work that we need to do to
pave the way (took sometime to distill it down).
Most issues to support Flink comes from usage of Spark caching extensively
in two places

a) HoodieBloomIndex : To achieve performance and scale, we cache the input
RDD for indexing. We make two passes over input and without the caching
Spark would recompute the RDD and it ofc wont scale.
b) WorkloadProfile : After indexing, we compute the number of inserts,
updates so that we can size file and layout data in the right order etc.

We need projects to

a) Build a new index, that based on something like HashIndex proposal here
https://github.com/apache/incubator-hudi/wiki/HashMap-Index . I have more
ideas, that I plan to dump into a HIP. But if someone can drive this, happy
to partner.
b) We need to look into an efficient way of knowing number of insert,
updates (which filegroups) across partitions. I have only early ideas. Need
some help here as well coming to a solution.

Both a & b, need some ground work/clean up. IIUC balaji is already working
on some of it.

If we can have volunteers for each of these areas, we can get underway.



On Thu, Sep 26, 2019 at 10:13 AM Vinoth Chandar  wrote:

> I'd kind of expect its not as fast today . But from the keynote in Flink
> Forward it seemed like its about to get lot better and we can definitely
> help by bringing hudi goodness to flink users.
>
> Others, what do you think? I think this is a good discussion to have
> upfront. For e.g the needs for streaming performance may need very
> different design..
>
> On Thu, Sep 26, 2019 at 9:56 AM Taher Koitawala 
> wrote:
>
>> Hi Vinoth,
>>   IMHO we should stick to Spark for micro batching for 2 reasons. 1:
>> Easy out use 2: Performance. Flink batch is not as fast as Spark. Also the
>> rich library of functions and the ease of integration which Spark has with
>> Hive etc that is not there in Flink batch.
>>
>>
>> Regards,
>> Taher Koitawala
>>
>>
>>
>> On Thu, Sep 26, 2019, 10:11 PM Vinoth Chandar  wrote:
>>
>> > How mature is Flink batch? I am wondering if we can with Flink Batch
>> APIs.
>> > Hudi itself will provide micro-batching semantics (incremental pull,
>> > upserts)?
>> > For true streaming performace, I am not sure even Cloud stores are
>> ready,
>> > since none of them support append()s.
>> > IMHO a mini/micro batch model makes sense for the kind of space Hudi
>> > solves.. We can keep working towards making this
>> >
>> > I have some thoughts on indexing, a new way to index, that complements
>> > BloomIndex and overcomes some of the spark caching needs.
>> > Will write it down over the weekend and share it..
>> >
>> > P.S: I am also wondering if we should generalize HIPs into a RFC
>> > process[1], that way it does not have to be a real technical proposal
>> and
>> > we can still have structured conversation on future roadmap etc. and
>> evolve
>> > the shape together..
>> > (May be this deserves its own separate DISCUSS thread)
>> >
>> >
>> > [1] - https://en.wikipedia.org/wiki/Request_for_Comments
>> >
>> >
>> > On Wed, Sep 25, 2019 at 9:07 PM Taher Koitawala 
>> > wrote:
>> >
>> > > Hi Vino,
>> > >  Agree with your suggestion. We all know when thought Flink is
>> > > streaming we can control how files get rolled out through
>> checkpointing
>> > > configurations. Bad config and small files get rolled out. Good config
>> > and
>> > > files are properly sized.
>> > >
>> > >  Also I understand the concern of reading files with Flink and
>> > > performance related to it. (I have faced it before). So how about we
>> > built
>> > > our own functions which can read and write efficiently and is not a
>> > source
>> > > function but an operator! So what I mean is let's use the Akka based
>> > > semantics of passing messages between these operators and read what is
>> > > required.
>> > >
>> > > Have 1 custom stream operator who takes requests for reading files
>> (Again
>> > > that is not a source function, it is an operator), that operator reads
>> > the
>> > > file and passes it downstream in a parallel manner. (May be AsyncIO
>> > > extension can be a better call here). Let me know your thoughts on
>> this.
>> > > I.e: if we choose everything will be written on core DataStreams API
>> > >
>> > > As per the Flink batch and Stream talk goes. I guess as a community we
>> > have
>> > > already agreed that for batch the spark engine is good enough and
>> streams
>> > > will be power by Flink where as beam will do both. By the time Flink
>> > > unification of batch and stream is completely our code will be batch
>> > > compatible with minimal changes.
>> > >
>> > > Further we need to plan what part of Flink API 

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-26 Thread Taher Koitawala
Hi Vinoth,
  IMHO we should stick to Spark for micro batching for 2 reasons. 1:
Easy out use 2: Performance. Flink batch is not as fast as Spark. Also the
rich library of functions and the ease of integration which Spark has with
Hive etc that is not there in Flink batch.


Regards,
Taher Koitawala



On Thu, Sep 26, 2019, 10:11 PM Vinoth Chandar  wrote:

> How mature is Flink batch? I am wondering if we can with Flink Batch APIs.
> Hudi itself will provide micro-batching semantics (incremental pull,
> upserts)?
> For true streaming performace, I am not sure even Cloud stores are ready,
> since none of them support append()s.
> IMHO a mini/micro batch model makes sense for the kind of space Hudi
> solves.. We can keep working towards making this
>
> I have some thoughts on indexing, a new way to index, that complements
> BloomIndex and overcomes some of the spark caching needs.
> Will write it down over the weekend and share it..
>
> P.S: I am also wondering if we should generalize HIPs into a RFC
> process[1], that way it does not have to be a real technical proposal and
> we can still have structured conversation on future roadmap etc. and evolve
> the shape together..
> (May be this deserves its own separate DISCUSS thread)
>
>
> [1] - https://en.wikipedia.org/wiki/Request_for_Comments
>
>
> On Wed, Sep 25, 2019 at 9:07 PM Taher Koitawala 
> wrote:
>
> > Hi Vino,
> >  Agree with your suggestion. We all know when thought Flink is
> > streaming we can control how files get rolled out through checkpointing
> > configurations. Bad config and small files get rolled out. Good config
> and
> > files are properly sized.
> >
> >  Also I understand the concern of reading files with Flink and
> > performance related to it. (I have faced it before). So how about we
> built
> > our own functions which can read and write efficiently and is not a
> source
> > function but an operator! So what I mean is let's use the Akka based
> > semantics of passing messages between these operators and read what is
> > required.
> >
> > Have 1 custom stream operator who takes requests for reading files (Again
> > that is not a source function, it is an operator), that operator reads
> the
> > file and passes it downstream in a parallel manner. (May be AsyncIO
> > extension can be a better call here). Let me know your thoughts on this.
> > I.e: if we choose everything will be written on core DataStreams API
> >
> > As per the Flink batch and Stream talk goes. I guess as a community we
> have
> > already agreed that for batch the spark engine is good enough and streams
> > will be power by Flink where as beam will do both. By the time Flink
> > unification of batch and stream is completely our code will be batch
> > compatible with minimal changes.
> >
> > Further we need to plan what part of Flink API will we use, should we
> stick
> > to hard core DataStreams API or should we merge it with Table APIs(I
> think
> > we should) so that we get Append sink, retract sink, temporial tables,
> the
> > capability to use the new blink table planner, also it would to some
> extent
> > lift our work of reading files since the table ApI I believe is good at
> > doing those things and comes with in build readers and writers. So
> > basically if we use that we could also do instream join the stream source
> > and the Hudi files to recompute upserts etc. AFAIK Flink Table ApI is
> also
> > giving a .cache() like functionality now. (Saw conversations about it in
> > the mailing list)
> >
> > So I think we really need to start planning such things to move ahead.
> > Other than that 100% with Vino that we cannot read files otherwise in
> Flink
> > it would be really bad to do that.
> >
> > Regards,
> > Taher Koitawala
> >
> > On Thu, Sep 26, 2019, 8:47 AM vino yang  wrote:
> >
> > > Hi
> > >
> > > A simple example. In Hudi Project, you can find many code snippet like
> > > `spark.read().format().load()` The load method can pass any path,
> > > especially DFS paths.
> > >
> > > While if we only want to use Flink streaming, there is not a good way
> to
> > > read HDFS now.
> > >
> > > In addition, we.also need to consider other ability between Flink and
> > > Spark. You should know Spark API(non-structured streaming mode) can
> > support
> > > both Streaming(micro-batch) and batch. However, Flink distinguishs them
> > > with two differentAPI, they have different feature set.
> > >
> > >
> > >
> > > On 09/25/2019 13:15, Semantic Beeng  wrote:
> > >
> > > Hi Vino,
> > >
> > > Would you be kind to start a wiki page to discuss this deep
> understanding
> > > of the functionality and design of Hudi?
> > >
> > > There you can put git links (https://github.com/ben-gibson/GitLink for
> > > intellij) and design knowledge so we can discuss in context.
> > >
> > > I am exploring the approach from this retweet
> > > https://twitter.com/semanticbeeng/status/117624125096789?s=20 and
> > > need this understanding you have.
> > >
> > > "difficult 

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-25 Thread Taher Koitawala
Hi Vino,
 Agree with your suggestion. We all know when thought Flink is
streaming we can control how files get rolled out through checkpointing
configurations. Bad config and small files get rolled out. Good config and
files are properly sized.

 Also I understand the concern of reading files with Flink and
performance related to it. (I have faced it before). So how about we built
our own functions which can read and write efficiently and is not a source
function but an operator! So what I mean is let's use the Akka based
semantics of passing messages between these operators and read what is
required.

Have 1 custom stream operator who takes requests for reading files (Again
that is not a source function, it is an operator), that operator reads the
file and passes it downstream in a parallel manner. (May be AsyncIO
extension can be a better call here). Let me know your thoughts on this.
I.e: if we choose everything will be written on core DataStreams API

As per the Flink batch and Stream talk goes. I guess as a community we have
already agreed that for batch the spark engine is good enough and streams
will be power by Flink where as beam will do both. By the time Flink
unification of batch and stream is completely our code will be batch
compatible with minimal changes.

Further we need to plan what part of Flink API will we use, should we stick
to hard core DataStreams API or should we merge it with Table APIs(I think
we should) so that we get Append sink, retract sink, temporial tables, the
capability to use the new blink table planner, also it would to some extent
lift our work of reading files since the table ApI I believe is good at
doing those things and comes with in build readers and writers. So
basically if we use that we could also do instream join the stream source
and the Hudi files to recompute upserts etc. AFAIK Flink Table ApI is also
giving a .cache() like functionality now. (Saw conversations about it in
the mailing list)

So I think we really need to start planning such things to move ahead.
Other than that 100% with Vino that we cannot read files otherwise in Flink
it would be really bad to do that.

Regards,
Taher Koitawala

On Thu, Sep 26, 2019, 8:47 AM vino yang  wrote:

> Hi
>
> A simple example. In Hudi Project, you can find many code snippet like
> `spark.read().format().load()` The load method can pass any path,
> especially DFS paths.
>
> While if we only want to use Flink streaming, there is not a good way to
> read HDFS now.
>
> In addition, we.also need to consider other ability between Flink and
> Spark. You should know Spark API(non-structured streaming mode) can support
> both Streaming(micro-batch) and batch. However, Flink distinguishs them
> with two differentAPI, they have different feature set.
>
>
>
> On 09/25/2019 13:15, Semantic Beeng  wrote:
>
> Hi Vino,
>
> Would you be kind to start a wiki page to discuss this deep understanding
> of the functionality and design of Hudi?
>
> There you can put git links (https://github.com/ben-gibson/GitLink for
> intellij) and design knowledge so we can discuss in context.
>
> I am exploring the approach from this retweet
> https://twitter.com/semanticbeeng/status/117624125096789?s=20 and
> need this understanding you have.
>
> "difficult to ignore Flink Batch API to match some features provide by
> Hudi now" - can you please post there some gitlinks to this?
>
> Thanks
>
> Nick
>
>
>
>
>
>
> On September 24, 2019 at 10:22 PM vino yang 
> wrote:
>
> Hi Taher,
>
> As I mentioned in the previous mail. Things may not be too easy by just
> using Flink state API.
>
> Copied here " Hudi can connect with many different Source/Sinks. Some
> file-based reads are not appropriate for Flink Streaming."
>
> Although, unify Batch and Streaming is Flink's goal. But, it is difficult
> to ignore Flink Batch API to match some features provide by Hudi now.
>
> The example you provided is in application layer about usage. So my
> suggestion is be patient, it needs time to give an detailed design.
>
> Best,
> Vino
>
>
>
> On 09/24/2019 22:38, Taher Koitawala  wrote:
> Hi All,
>  Sample code to see how records tagging will be handled in
> Flink is posted on [1]. The main class to run the same is MockHudi.java
> with a sample path for checkpointing.
>
> As of now this is just a sample to know we should ke caching in Flink
> states with bare minimum configs.
>
>
> As per my experience I have cached around 10s of TBs in Flink rocksDB
> state
> with the right configs. So I'm sure it should work here as well.
>
> 1:
>
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
>
> Regards,
> Taher Koitawala
>
>
> On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar  wrote:
>
> > It wont be much different than the HBaseIndex we have today. Would like
> to
> > have always have an option like BloomIndex that does not need any
> external
> > dependencies.
> > The moment you bring an external data 

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-25 Thread vino yang
Hi A simple example. In Hudi Project, you can find many code snippet like 
`spark.read().format().load()` The load method can pass any path, especially 
DFS paths. While if we only want to use Flink streaming, there is not a good 
way to read HDFS now. In addition, we.also need to consider other ability 
between Flink and Spark. You should know Spark API(non-structured streaming 
mode) can support both Streaming(micro-batch) and batch. However, Flink 
distinguishs them with two differentAPI, they have different feature set. On 
09/25/2019 13:15, Semantic Beeng wrote: Hi Vino, Would you be kind to start a 
wiki page to discuss this deep understanding of the functionality and design of 
Hudi? There you can put git links (https://github.com/ben-gibson/GitLink for 
intellij) and design knowledge so we can discuss in context. I am exploring the 
approach from this retweet 
https://twitter.com/semanticbeeng/status/117624125096789?s=20 and need this 
understanding you have. "difficult to ignore Flink Batch API to match some 
features provide by Hudi now" - can you please post there some gitlinks to 
this? Thanks Nick On September 24, 2019 at 10:22 PM vino yang 
 wrote: Hi Taher, As I mentioned in the previous mail. 
Things may not be too easy by just using Flink state API. Copied here " Hudi 
can connect with many different Source/Sinks. Some file-based reads are not 
appropriate for Flink Streaming." Although, unify Batch and Streaming is 
Flink's goal. But, it is difficult to ignore Flink Batch API to match some 
features provide by Hudi now. The example you provided is in application layer 
about usage. So my suggestion is be patient, it needs time to give an detailed 
design. Best, Vino On 09/24/2019 22:38, Taher Koitawala wrote:Hi All,           
   Sample code to see how records tagging will be handled in Flink is posted on 
[1]. The main class to run the same is MockHudi.java with a sample path for 
checkpointing. As of now this is just a sample to know we should ke caching in 
Flink states with bare minimum configs. As per my experience I have cached 
around 10s of TBs in Flink rocksDB state with the right configs. So I'm sure it 
should work here as well. 1: 
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
 Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar 
 wrote: > It wont be much different than the HBaseIndex we 
have today. Would like to > have always have an option like BloomIndex that 
does not need any external > dependencies. > The moment you bring an external 
data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM 
Semantic Beeng  > wrote: > > > @vc can you see how 
ApacheCrail could be used to implement this at scale > > but also in a way that 
abstracts over both Spark and Flink? > > > > "Crail Store implements a 
hierarchical namespace across a cluster of RDMA > > interconnected storage 
resources such as DRAM or flash" > > > > 
https://crail.incubator.apache.org/overview/ > > > > + 2 cents > > 
https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > 
Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar 
 > wrote: > > > > > > It could be much larger. :) imagine 
billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The 
advantage of the current bloom index is that its effectively stored > > with 
data itself and this reduces complexity in terms of keeping index > and > > 
data consistent etc > > > > One orthogonal idea from long time ago that moves 
indexing out of data > > storage and is generalizable > > > > 
https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone 
here knows flink well and can implement some standalone flink > > code to mimic 
tagLocation() functionality and share with the group, that > > would be great. 
Lets worry about performance once we have a flink DAG. I > > think this is a 
critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21, 
2019 at 4:17 AM Vinay Patil  > > wrote: > > > > Hi 
Taher, > > > > I agree with this , if the state is becoming too large we should 
have an > > option of storing it in external state like File System or RocksDb. 
> > > > @Vinoth Chandar  can the state of HoodieBloomIndex 
go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On 
Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala  > > wrote: > 
> > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex > 
> with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >> 
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala  > > >> 
wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I 
understand the > > >> > HoodieBloomIndex mainly caches key and partition path. 
Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where 
the user chooses to cache > > the > > >> > Index on heap, RockDBState where 
indexes are 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread Taher Koitawala
Hi Vino,
  This is not a design for Hudi on Flink. This was simply a mock up of
tagLocations() spark cache to Flink state as Vinoth wanted to see.

As per the Flink batch and Streaming I am well aware of the batch and
Stream unification efforts of Flink. However I think that is still on
progress and for now to batch on Hudi let's stick to Spark, however only
for streaming let's use Flink streaming.



Regards,
Taher Koitawala

On Wed, Sep 25, 2019, 7:52 AM vino yang  wrote:

> Hi Taher,
>
> As I mentioned in the previous mail. Things may not be too easy by just
> using Flink state API.
>
> Copied here "Hudi can connect with many different Source/Sinks. Some
> file-based reads are not appropriate for Flink Streaming."
>
> Although, unify Batch and Streaming is Flink's goal. But, it is difficult
> to ignore Flink Batch API to match some features provide by Hudi now.
>
> The example you provided is in application layer about usage. So my
> suggestion is be patient, it needs time to give an detailed design.
>
> Best,
> Vino
>
>
>
> On 09/24/2019 22:38, Taher Koitawala  wrote:
> Hi All,
>  Sample code to see how records tagging will be handled in
> Flink is posted on [1]. The main class to run the same is MockHudi.java
> with a sample path for checkpointing.
>
> As of now this is just a sample to know we should ke caching in Flink
> states with bare minimum configs.
>
>
> As per my experience I have cached around 10s of TBs in Flink rocksDB
> state
> with the right configs. So I'm sure it should work here as well.
>
> 1:
>
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
>
> Regards,
> Taher Koitawala
>
>
> On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar  wrote:
>
> > It wont be much different than the HBaseIndex we have today. Would like
> to
> > have always have an option like BloomIndex that does not need any
> external
> > dependencies.
> > The moment you bring an external data store in, someone becomes a DBA.
> :)
> >
> > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng 
> > wrote:
> >
> > > @vc can you see how ApacheCrail could be used to implement this at
> scale
> > > but also in a way that abstracts over both Spark and Flink?
> > >
> > > "Crail Store implements a hierarchical namespace across a cluster of
> RDMA
> > > interconnected storage resources such as DRAM or flash"
> > >
> > > https://crail.incubator.apache.org/overview/
> > >
> > > + 2 cents
> > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20
> > >
> > > Cheers
> > >
> > > Nick
> > >
> > > On September 22, 2019 at 9:28 AM Vinoth Chandar 
> > wrote:
> > >
> > >
> > > It could be much larger. :) imagine billions of keys each 32 bytes,
> > mapped
> > > to another 32 byte
> > >
> > > The advantage of the current bloom index is that its effectively
> stored
> > > with data itself and this reduces complexity in terms of keeping index
> > and
> > > data consistent etc
> > >
> > > One orthogonal idea from long time ago that moves indexing out of data
> > > storage and is generalizable
> > >
> > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index
> > >
> > > If someone here knows flink well and can implement some standalone
> flink
> > > code to mimic tagLocation() functionality and share with the group,
> that
> > > would be great. Lets worry about performance once we have a flink DAG.
> I
> > > think this is a critical and most tricky piece in supporting flink.
> > >
> > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil 
> > > wrote:
> > >
> > > Hi Taher,
> > >
> > > I agree with this , if the state is becoming too large we should have
> an
> > > option of storing it in external state like File System or RocksDb.
> > >
> > > @Vinoth Chandar  can the state of HoodieBloomIndex
> go
> > > beyond 10-15 GB
> > >
> > > Regards,
> > > Vinay Patil
> > >
> > > >
> > >
> > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala 
> > > wrote:
> > >
> > > >> Hey Guys, Any thoughts on the above idea? To handle
> HoodieBloomIndex
> > > with
> > > >> HeapState, RocksDBState and FsState but on Spark.
> > > >>
> > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala 
>
> > > >> wrote:
> > > >>
> > > >> > Hi Vinoth,
> > > >> > Having seen the doc and code. I understand the
> > > >> > HoodieBloomIndex mainly caches key and partition path. Can we
> > address
> > > >> how
> > > >> > Flink does it? Like, have HeapState where the user chooses to
> cache
> > > the
> > > >> > Index on heap, RockDBState where indexes are written to RocksDB
> and
> > > >> finally
> > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And
> on
> > > top,
> > > >> we
> > > >> > can do an index Time To Live.
> > > >> >
> > > >> > Regards,
> > > >> > Taher Koitawala
> > > >> >
> > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar <
> vin...@apache.org>
> > > >> wrote:
> > > >> >
> > > >> >> I still feel the key thing here is reimplementing
> HoodieBloomIndex
> > > >> without
> > > >> >> needing 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread vino yang
Hi Taher, As I mentioned in the previous mail. Things may not be too easy by 
just using Flink state API. Copied here "Hudi can connect with many different 
Source/Sinks. Some file-based reads are not appropriate for Flink Streaming." 
Although, unify Batch and Streaming is Flink's goal. But, it is difficult to 
ignore Flink Batch API to match some features provide by Hudi now. The example 
you provided is in application layer about usage. So my suggestion is be 
patient, it needs time to give an detailed design. Best, Vino On 09/24/2019 
22:38, Taher Koitawala wrote: Hi All,              Sample code to see how 
records tagging will be handled in Flink is posted on [1]. The main class to 
run the same is MockHudi.java with a sample path for checkpointing. As of now 
this is just a sample to know we should ke caching in Flink states with bare 
minimum configs. As per my experience I have cached around 10s of TBs in Flink 
rocksDB state with the right configs. So I'm sure it should work here as well. 
1: 
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
 Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar 
 wrote: > It wont be much different than the HBaseIndex we 
have today. Would like to > have always have an option like BloomIndex that 
does not need any external > dependencies. > The moment you bring an external 
data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM 
Semantic Beeng  > wrote: > > > @vc can you see how 
ApacheCrail could be used to implement this at scale > > but also in a way that 
abstracts over both Spark and Flink? > > > > "Crail Store implements a 
hierarchical namespace across a cluster of RDMA > > interconnected storage 
resources such as DRAM or flash" > > > > 
https://crail.incubator.apache.org/overview/ > > > > + 2 cents > > 
https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > 
Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar 
 > wrote: > > > > > > It could be much larger. :) imagine 
billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The 
advantage of the current bloom index is that its effectively stored > > with 
data itself and this reduces complexity in terms of keeping index > and > > 
data consistent etc > > > > One orthogonal idea from long time ago that moves 
indexing out of data > > storage and is generalizable > > > > 
https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone 
here knows flink well and can implement some standalone flink > > code to mimic 
tagLocation() functionality and share with the group, that > > would be great. 
Lets worry about performance once we have a flink DAG. I > > think this is a 
critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21, 
2019 at 4:17 AM Vinay Patil  > > wrote: > > > > Hi 
Taher, > > > > I agree with this , if the state is becoming too large we should 
have an > > option of storing it in external state like File System or RocksDb. 
> > > > @Vinoth Chandar  can the state of HoodieBloomIndex 
go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On 
Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala  > > wrote: > 
> > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex > 
> with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >> 
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala  > > >> 
wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I 
understand the > > >> > HoodieBloomIndex mainly caches key and partition path. 
Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where 
the user chooses to cache > > the > > >> > Index on heap, RockDBState where 
indexes are written to RocksDB and > > >> finally > > >> > FsState where 
indexes can be written to HDFS, S3, Azure Fs. And on > > top, > > >> we > > >> 
> can do an index Time To Live. > > >> > > > >> > Regards, > > >> > Taher 
Koitawala > > >> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar 
 > > >> wrote: > > >> > > > >> >> I still feel the key thing 
here is reimplementing HoodieBloomIndex > > >> without > > >> >> needing spark 
caching. > > >> >> > > >> >> > > >> > > > 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global
 > > ) > > >> >> documents the spark DAG in detail. > > >> >> > > >> >> If 
everyone feels, it's best for me to scope the work out, then > happy > > >> to 
> > >> >> do > > >> >> it! > > >> >> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM 
Taher Koitawala < > taher...@gmail.com > > > > > >> >> wrote: > > >> >> > > >> 
>> > Guys I think we are slowing down on this again. We need to start > > >> >> 
planning > > >> >> > small small tasks towards this VC please can you help fast 
track > > >> this? > > >> >> > > > >> >> > Regards, > > >> >> > Taher Koitawala 
> > >> >> > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread Taher Koitawala
Hi All,
  Sample code to see how records tagging will be handled in
Flink is posted on [1]. The main class to run the same is MockHudi.java
with a sample path for checkpointing.

As of now this is just a sample to know we should ke caching in Flink
states with bare minimum configs.


As per my experience I have cached around 10s of TBs in Flink rocksDB state
with the right configs. So I'm sure it should work here as well.

1:
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi

Regards,
Taher Koitawala


On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar  wrote:

> It wont be much different than the HBaseIndex we have today. Would like to
> have always have an option like BloomIndex that does not need any external
> dependencies.
> The moment you bring an external data store in, someone becomes a DBA. :)
>
> On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng 
> wrote:
>
> > @vc can you see how ApacheCrail could be used to implement this at scale
> > but also in a way that abstracts over both Spark and Flink?
> >
> > "Crail Store implements a hierarchical namespace across a cluster of RDMA
> > interconnected storage resources such as DRAM or flash"
> >
> > https://crail.incubator.apache.org/overview/
> >
> > + 2 cents
> > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20
> >
> > Cheers
> >
> > Nick
> >
> > On September 22, 2019 at 9:28 AM Vinoth Chandar 
> wrote:
> >
> >
> > It could be much larger. :) imagine billions of keys each 32 bytes,
> mapped
> > to another 32 byte
> >
> > The advantage of the current bloom index is that its effectively stored
> > with data itself and this reduces complexity in terms of keeping index
> and
> > data consistent etc
> >
> > One orthogonal idea from long time ago that moves indexing out of data
> > storage and is generalizable
> >
> > https://github.com/apache/incubator-hudi/wiki/HashMap-Index
> >
> > If someone here knows flink well and can implement some standalone flink
> > code to mimic tagLocation() functionality and share with the group, that
> > would be great. Lets worry about performance once we have a flink DAG. I
> > think this is a critical and most tricky piece in supporting flink.
> >
> > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil 
> > wrote:
> >
> > Hi Taher,
> >
> > I agree with this , if the state is becoming too large we should have an
> > option of storing it in external state like File System or RocksDb.
> >
> > @Vinoth Chandar  can the state of HoodieBloomIndex go
> > beyond 10-15 GB
> >
> > Regards,
> > Vinay Patil
> >
> > >
> >
> > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala 
> > wrote:
> >
> > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex
> > with
> > >> HeapState, RocksDBState and FsState but on Spark.
> > >>
> > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala 
> > >> wrote:
> > >>
> > >> > Hi Vinoth,
> > >> > Having seen the doc and code. I understand the
> > >> > HoodieBloomIndex mainly caches key and partition path. Can we
> address
> > >> how
> > >> > Flink does it? Like, have HeapState where the user chooses to cache
> > the
> > >> > Index on heap, RockDBState where indexes are written to RocksDB and
> > >> finally
> > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And on
> > top,
> > >> we
> > >> > can do an index Time To Live.
> > >> >
> > >> > Regards,
> > >> > Taher Koitawala
> > >> >
> > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar 
> > >> wrote:
> > >> >
> > >> >> I still feel the key thing here is reimplementing HoodieBloomIndex
> > >> without
> > >> >> needing spark caching.
> > >> >>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global
> > )
> > >> >> documents the spark DAG in detail.
> > >> >>
> > >> >> If everyone feels, it's best for me to scope the work out, then
> happy
> > >> to
> > >> >> do
> > >> >> it!
> > >> >>
> > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <
> taher...@gmail.com
> > >
> > >> >> wrote:
> > >> >>
> > >> >> > Guys I think we are slowing down on this again. We need to start
> > >> >> planning
> > >> >> > small small tasks towards this VC please can you help fast track
> > >> this?
> > >> >> >
> > >> >> > Regards,
> > >> >> > Taher Koitawala
> > >> >> >
> > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar  >
> > >> >> wrote:
> > >> >> >
> > >> >> > > Look forward to the analysis. A key class to read would be
> > >> >> > > HoodieBloomIndex, which uses a lot of spark caching and
> shuffles.
> > >> >> > >
> > >> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <
> yanghua1...@gmail.com
> > >
> > >> >> wrote:
> > >> >> > >
> > >> >> > > > >> Currently Spark Streaming micro batching fits well with
> > Hudi,
> > >> >> since
> > >> >> > it
> > >> >> > > > amortizes the cost of indexing, workload profiling etc. 1
> spark
> > >> >> micro
> > >> >> > > batch
> > >> >> > > > = 1 hudi commit
> > >> >> > > > With the 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-21 Thread Vinay Patil
Hi Taher,

I agree with this , if the state is becoming too large we should have an
option of storing it in external state like File System or RocksDb.

@Vinoth Chandar   can the state of HoodieBloomIndex go
beyond 10-15 GB

Regards,
Vinay Patil


On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala  wrote:

> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex with
> HeapState, RocksDBState and FsState but on Spark.
>
> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala 
> wrote:
>
> > Hi Vinoth,
> >Having seen the doc and code. I understand the
> > HoodieBloomIndex mainly caches key and partition path. Can we address how
> > Flink does it? Like, have  HeapState where the user chooses to cache the
> > Index on heap, RockDBState where indexes are written to RocksDB and
> finally
> > FsState where indexes can be written to HDFS, S3, Azure Fs. And on top,
> we
> > can do an index Time To Live.
> >
> > Regards,
> > Taher Koitawala
> >
> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar 
> wrote:
> >
> >> I still feel the key thing here is reimplementing HoodieBloomIndex
> without
> >> needing spark caching.
> >>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
> >>  documents the spark DAG in detail.
> >>
> >> If everyone feels, it's best for me to scope the work out, then happy to
> >> do
> >> it!
> >>
> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala 
> >> wrote:
> >>
> >> > Guys I think we are slowing down on this again. We need to start
> >> planning
> >> > small small tasks towards this VC please can you help fast track this?
> >> >
> >> > Regards,
> >> > Taher Koitawala
> >> >
> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar 
> >> wrote:
> >> >
> >> > > Look forward to the analysis. A key class to read would be
> >> > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> >> > >
> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang 
> >> wrote:
> >> > >
> >> > > > >> Currently Spark Streaming micro batching fits well with Hudi,
> >> since
> >> > it
> >> > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> >> micro
> >> > > batch
> >> > > > = 1 hudi commit
> >> > > > With the per-record model in Flink, I am not sure how useful it
> >> will be
> >> > > to
> >> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> >> will
> >> > > be
> >> > > > inefficient..
> >> > > >
> >> > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
> >> About
> >> > > > Flink streaming, we can also implement the "batch" and
> "micro-batch"
> >> > > model
> >> > > > when process data. For example:
> >> > > >
> >> > > >- aggregation: use flexibility window mechanism;
> >> > > >- non-aggregation: use Flink stateful state API cache a batch
> >> data
> >> > > >
> >> > > >
> >> > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
> >> full
> >> > > > summary of how Spark is being used in a wiki page is a good start
> >> IMO.
> >> > We
> >> > > > can then hash out what can be generalized and what cannot be and
> >> needs
> >> > to
> >> > > > be left in hudi-client-spark vs hudi-client-core
> >> > > >
> >> > > > agree
> >> > > >
> >> > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> >> > > >
> >> > > > > >> We should only stick to Flink Streaming. Furthermore if there
> >> is a
> >> > > > > requirement for batch then users
> >> > > > > >> should use Spark or then we will anyway have a beam
> integration
> >> > > coming
> >> > > > > up.
> >> > > > >
> >> > > > > Currently Spark Streaming micro batching fits well with Hudi,
> >> since
> >> > it
> >> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> >> micro
> >> > > > batch
> >> > > > > = 1 hudi commit
> >> > > > > With the per-record model in Flink, I am not sure how useful it
> >> will
> >> > be
> >> > > > to
> >> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> it
> >> > will
> >> > > > be
> >> > > > > inefficient..
> >> > > > >
> >> > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
> >> full
> >> > > > > summary of how Spark is being used in a wiki page is a good
> start
> >> > IMO.
> >> > > We
> >> > > > > can then hash out what can be generalized and what cannot be and
> >> > needs
> >> > > to
> >> > > > > be left in hudi-client-spark vs hudi-client-core
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <
> yanghua1...@gmail.com>
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi Nick and Taher,
> >> > > > > >
> >> > > > > > I just want to answer Nishith's question. Reference his old
> >> > > description
> >> > > > > > here:
> >> > > > > >
> >> > > > > > > You can do a parallel investigation while we are deciding on
> >> the
> >> > > > module
> >> > > > > > structure.  You could be looking at all the patterns in Hudi's
> >> > Spark
> >> > > > APIs
> >> > > > > > usage 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-20 Thread Taher Koitawala
Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex with
HeapState, RocksDBState and FsState but on Spark.

On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala  wrote:

> Hi Vinoth,
>Having seen the doc and code. I understand the
> HoodieBloomIndex mainly caches key and partition path. Can we address how
> Flink does it? Like, have  HeapState where the user chooses to cache the
> Index on heap, RockDBState where indexes are written to RocksDB and finally
> FsState where indexes can be written to HDFS, S3, Azure Fs. And on top, we
> can do an index Time To Live.
>
> Regards,
> Taher Koitawala
>
> On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar  wrote:
>
>> I still feel the key thing here is reimplementing HoodieBloomIndex without
>> needing spark caching.
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
>>  documents the spark DAG in detail.
>>
>> If everyone feels, it's best for me to scope the work out, then happy to
>> do
>> it!
>>
>> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala 
>> wrote:
>>
>> > Guys I think we are slowing down on this again. We need to start
>> planning
>> > small small tasks towards this VC please can you help fast track this?
>> >
>> > Regards,
>> > Taher Koitawala
>> >
>> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar 
>> wrote:
>> >
>> > > Look forward to the analysis. A key class to read would be
>> > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
>> > >
>> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang 
>> wrote:
>> > >
>> > > > >> Currently Spark Streaming micro batching fits well with Hudi,
>> since
>> > it
>> > > > amortizes the cost of indexing, workload profiling etc. 1 spark
>> micro
>> > > batch
>> > > > = 1 hudi commit
>> > > > With the per-record model in Flink, I am not sure how useful it
>> will be
>> > > to
>> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
>> will
>> > > be
>> > > > inefficient..
>> > > >
>> > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
>> About
>> > > > Flink streaming, we can also implement the "batch" and "micro-batch"
>> > > model
>> > > > when process data. For example:
>> > > >
>> > > >- aggregation: use flexibility window mechanism;
>> > > >- non-aggregation: use Flink stateful state API cache a batch
>> data
>> > > >
>> > > >
>> > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
>> full
>> > > > summary of how Spark is being used in a wiki page is a good start
>> IMO.
>> > We
>> > > > can then hash out what can be generalized and what cannot be and
>> needs
>> > to
>> > > > be left in hudi-client-spark vs hudi-client-core
>> > > >
>> > > > agree
>> > > >
>> > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
>> > > >
>> > > > > >> We should only stick to Flink Streaming. Furthermore if there
>> is a
>> > > > > requirement for batch then users
>> > > > > >> should use Spark or then we will anyway have a beam integration
>> > > coming
>> > > > > up.
>> > > > >
>> > > > > Currently Spark Streaming micro batching fits well with Hudi,
>> since
>> > it
>> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
>> micro
>> > > > batch
>> > > > > = 1 hudi commit
>> > > > > With the per-record model in Flink, I am not sure how useful it
>> will
>> > be
>> > > > to
>> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
>> > will
>> > > > be
>> > > > > inefficient..
>> > > > >
>> > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
>> full
>> > > > > summary of how Spark is being used in a wiki page is a good start
>> > IMO.
>> > > We
>> > > > > can then hash out what can be generalized and what cannot be and
>> > needs
>> > > to
>> > > > > be left in hudi-client-spark vs hudi-client-core
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang 
>> > > wrote:
>> > > > >
>> > > > > > Hi Nick and Taher,
>> > > > > >
>> > > > > > I just want to answer Nishith's question. Reference his old
>> > > description
>> > > > > > here:
>> > > > > >
>> > > > > > > You can do a parallel investigation while we are deciding on
>> the
>> > > > module
>> > > > > > structure.  You could be looking at all the patterns in Hudi's
>> > Spark
>> > > > APIs
>> > > > > > usage (RDD/DataSource/SparkContext) and see if such support can
>> be
>> > > > > achieved
>> > > > > > in theory with Flink. If not, what is the workaround.
>> Documenting
>> > > such
>> > > > > > patterns would be valuable when multiple engineers are working
>> on
>> > it.
>> > > > For
>> > > > > > e:g, Hudi relies on (a) custom partitioning logic for
>> upserts,
>> > > > >  (b)
>> > > > > > caching RDDs to avoid reruns of costly stages (c) A Spark
>> > upsert
>> > > > task
>> > > > > > knowing its spark partition/task/attempt ids
>> > > > > >
>> > > > > > And just like the title of this thread, we are going to try to
>> > > decouple
>> > > 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-17 Thread Taher Koitawala
Hi Vinoth,
   Having seen the doc and code. I understand the
HoodieBloomIndex mainly caches key and partition path. Can we address how
Flink does it? Like, have  HeapState where the user chooses to cache the
Index on heap, RockDBState where indexes are written to RocksDB and finally
FsState where indexes can be written to HDFS, S3, Azure Fs. And on top, we
can do an index Time To Live.

Regards,
Taher Koitawala

On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar  wrote:

> I still feel the key thing here is reimplementing HoodieBloomIndex without
> needing spark caching.
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
>  documents the spark DAG in detail.
>
> If everyone feels, it's best for me to scope the work out, then happy to do
> it!
>
> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala 
> wrote:
>
> > Guys I think we are slowing down on this again. We need to start planning
> > small small tasks towards this VC please can you help fast track this?
> >
> > Regards,
> > Taher Koitawala
> >
> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar  wrote:
> >
> > > Look forward to the analysis. A key class to read would be
> > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > >
> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang 
> wrote:
> > >
> > > > >> Currently Spark Streaming micro batching fits well with Hudi,
> since
> > it
> > > > amortizes the cost of indexing, workload profiling etc. 1 spark micro
> > > batch
> > > > = 1 hudi commit
> > > > With the per-record model in Flink, I am not sure how useful it will
> be
> > > to
> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> will
> > > be
> > > > inefficient..
> > > >
> > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
> About
> > > > Flink streaming, we can also implement the "batch" and "micro-batch"
> > > model
> > > > when process data. For example:
> > > >
> > > >- aggregation: use flexibility window mechanism;
> > > >- non-aggregation: use Flink stateful state API cache a batch data
> > > >
> > > >
> > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
> full
> > > > summary of how Spark is being used in a wiki page is a good start
> IMO.
> > We
> > > > can then hash out what can be generalized and what cannot be and
> needs
> > to
> > > > be left in hudi-client-spark vs hudi-client-core
> > > >
> > > > agree
> > > >
> > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> > > >
> > > > > >> We should only stick to Flink Streaming. Furthermore if there
> is a
> > > > > requirement for batch then users
> > > > > >> should use Spark or then we will anyway have a beam integration
> > > coming
> > > > > up.
> > > > >
> > > > > Currently Spark Streaming micro batching fits well with Hudi, since
> > it
> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> micro
> > > > batch
> > > > > = 1 hudi commit
> > > > > With the per-record model in Flink, I am not sure how useful it
> will
> > be
> > > > to
> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> > will
> > > > be
> > > > > inefficient..
> > > > >
> > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
> full
> > > > > summary of how Spark is being used in a wiki page is a good start
> > IMO.
> > > We
> > > > > can then hash out what can be generalized and what cannot be and
> > needs
> > > to
> > > > > be left in hudi-client-spark vs hudi-client-core
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang 
> > > wrote:
> > > > >
> > > > > > Hi Nick and Taher,
> > > > > >
> > > > > > I just want to answer Nishith's question. Reference his old
> > > description
> > > > > > here:
> > > > > >
> > > > > > > You can do a parallel investigation while we are deciding on
> the
> > > > module
> > > > > > structure.  You could be looking at all the patterns in Hudi's
> > Spark
> > > > APIs
> > > > > > usage (RDD/DataSource/SparkContext) and see if such support can
> be
> > > > > achieved
> > > > > > in theory with Flink. If not, what is the workaround. Documenting
> > > such
> > > > > > patterns would be valuable when multiple engineers are working on
> > it.
> > > > For
> > > > > > e:g, Hudi relies on (a) custom partitioning logic for
> upserts,
> > > > >  (b)
> > > > > > caching RDDs to avoid reruns of costly stages (c) A Spark
> > upsert
> > > > task
> > > > > > knowing its spark partition/task/attempt ids
> > > > > >
> > > > > > And just like the title of this thread, we are going to try to
> > > decouple
> > > > > > Hudi and Spark. That means we can run the whole Hudi without
> > > depending
> > > > > > Spark. So we need to analyze all the usage of Spark in Hudi.
> > > > > >
> > > > > > Here we are not discussing the integration of Hudi and Flink in
> the
> > > > > > application layer. Instead, I want Hudi to be decoupled from
> Spark
> > > and
> > > > > > allow 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
Alright then. Happy to take the lead here. But please give me a week or so,
to finish up the spark bundling and other jar issues.. Too much context
switching :)



On Mon, Sep 16, 2019 at 6:57 PM vino yang  wrote:

> Hi guys,
>
> Currently, I am busy with HUDI-203[1] and other things.
>
> I agree with Vinoth that we should try to find a new solution to decouple
> the dependency with the Spark RDD cache.
>
> It's an excellent way to start this big work.
>
> [1]: https://issues.apache.org/jira/browse/HUDI-203
>
> vbal...@apache.org  于2019年9月17日周二 上午3:49写道:
>
> >
> > +1 This is a pretty large undertaking. While the community is getting
> > their hands dirty and ramping up on Hudi internals, it would be
> productive
> > if Vinoth shepherds this
> > Balaji.VOn Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth
> Chandar
> >  wrote:
> >
> >  sg. :)
> >
> > I will wait for others on this thread as well to chime in.
> >
> > On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala 
> > wrote:
> >
> > > Vinoth, I think right now given your experience with the project you
> > should
> > > be scoping out what needs to be done to take us there. So +1 for giving
> > you
> > > more work :)
> > >
> > > We want to reach a point where we can start scoping out addition of
> Flink
> > > and Beam components within. Then I think will tremendous progress.
> > >
> > > On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar 
> wrote:
> > >
> > > > I still feel the key thing here is reimplementing HoodieBloomIndex
> > > without
> > > > needing spark caching.
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
> > > >  documents the spark DAG in detail.
> > > >
> > > > If everyone feels, it's best for me to scope the work out, then happy
> > to
> > > do
> > > > it!
> > > >
> > > > On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala  >
> > > > wrote:
> > > >
> > > > > Guys I think we are slowing down on this again. We need to start
> > > planning
> > > > > small small tasks towards this VC please can you help fast track
> > this?
> > > > >
> > > > > Regards,
> > > > > Taher Koitawala
> > > > >
> > > > > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar 
> > > wrote:
> > > > >
> > > > > > Look forward to the analysis. A key class to read would be
> > > > > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 7:52 PM vino yang  >
> > > > wrote:
> > > > > >
> > > > > > > >> Currently Spark Streaming micro batching fits well with
> Hudi,
> > > > since
> > > > > it
> > > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> > > micro
> > > > > > batch
> > > > > > > = 1 hudi commit
> > > > > > > With the per-record model in Flink, I am not sure how useful it
> > > will
> > > > be
> > > > > > to
> > > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> > it
> > > > will
> > > > > > be
> > > > > > > inefficient..
> > > > > > >
> > > > > > > Yes, if 1 input record = 1 hudi commit, it would be
> inefficient.
> > > > About
> > > > > > > Flink streaming, we can also implement the "batch" and
> > > "micro-batch"
> > > > > > model
> > > > > > > when process data. For example:
> > > > > > >
> > > > > > >- aggregation: use flexibility window mechanism;
> > > > > > >- non-aggregation: use Flink stateful state API cache a
> batch
> > > data
> > > > > > >
> > > > > > >
> > > > > > > >> On first focussing on decoupling of Spark and Hudi alone,
> yes
> > a
> > > > full
> > > > > > > summary of how Spark is being used in a wiki page is a good
> start
> > > > IMO.
> > > > > We
> > > > > > > can then hash out what can be generalized and what cannot be
> and
> > > > needs
> > > > > to
> > > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > > >
> > > > > > > agree
> > > > > > >
> > > > > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> > > > > > >
> > > > > > > > >> We should only stick to Flink Streaming. Furthermore if
> > there
> > > > is a
> > > > > > > > requirement for batch then users
> > > > > > > > >> should use Spark or then we will anyway have a beam
> > > integration
> > > > > > coming
> > > > > > > > up.
> > > > > > > >
> > > > > > > > Currently Spark Streaming micro batching fits well with Hudi,
> > > since
> > > > > it
> > > > > > > > amortizes the cost of indexing, workload profiling etc. 1
> spark
> > > > micro
> > > > > > > batch
> > > > > > > > = 1 hudi commit
> > > > > > > > With the per-record model in Flink, I am not sure how useful
> it
> > > > will
> > > > > be
> > > > > > > to
> > > > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi
> commit,
> > > it
> > > > > will
> > > > > > > be
> > > > > > > > inefficient..
> > > > > > > >
> > > > > > > > On first focussing on decoupling of Spark and Hudi alone,
> yes a
> > > > full
> > > > > > > > summary of how Spark is being used in a wiki page is a good
> > start
> > > > > IMO.
> > > > > > We
> > > > > 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread vino yang
Hi guys,

Currently, I am busy with HUDI-203[1] and other things.

I agree with Vinoth that we should try to find a new solution to decouple
the dependency with the Spark RDD cache.

It's an excellent way to start this big work.

[1]: https://issues.apache.org/jira/browse/HUDI-203

vbal...@apache.org  于2019年9月17日周二 上午3:49写道:

>
> +1 This is a pretty large undertaking. While the community is getting
> their hands dirty and ramping up on Hudi internals, it would be productive
> if Vinoth shepherds this
> Balaji.VOn Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth Chandar
>  wrote:
>
>  sg. :)
>
> I will wait for others on this thread as well to chime in.
>
> On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala 
> wrote:
>
> > Vinoth, I think right now given your experience with the project you
> should
> > be scoping out what needs to be done to take us there. So +1 for giving
> you
> > more work :)
> >
> > We want to reach a point where we can start scoping out addition of Flink
> > and Beam components within. Then I think will tremendous progress.
> >
> > On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar  wrote:
> >
> > > I still feel the key thing here is reimplementing HoodieBloomIndex
> > without
> > > needing spark caching.
> > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
> > >  documents the spark DAG in detail.
> > >
> > > If everyone feels, it's best for me to scope the work out, then happy
> to
> > do
> > > it!
> > >
> > > On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala 
> > > wrote:
> > >
> > > > Guys I think we are slowing down on this again. We need to start
> > planning
> > > > small small tasks towards this VC please can you help fast track
> this?
> > > >
> > > > Regards,
> > > > Taher Koitawala
> > > >
> > > > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar 
> > wrote:
> > > >
> > > > > Look forward to the analysis. A key class to read would be
> > > > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > > > >
> > > > > On Tue, Aug 13, 2019 at 7:52 PM vino yang 
> > > wrote:
> > > > >
> > > > > > >> Currently Spark Streaming micro batching fits well with Hudi,
> > > since
> > > > it
> > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> > micro
> > > > > batch
> > > > > > = 1 hudi commit
> > > > > > With the per-record model in Flink, I am not sure how useful it
> > will
> > > be
> > > > > to
> > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> it
> > > will
> > > > > be
> > > > > > inefficient..
> > > > > >
> > > > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
> > > About
> > > > > > Flink streaming, we can also implement the "batch" and
> > "micro-batch"
> > > > > model
> > > > > > when process data. For example:
> > > > > >
> > > > > >- aggregation: use flexibility window mechanism;
> > > > > >- non-aggregation: use Flink stateful state API cache a batch
> > data
> > > > > >
> > > > > >
> > > > > > >> On first focussing on decoupling of Spark and Hudi alone, yes
> a
> > > full
> > > > > > summary of how Spark is being used in a wiki page is a good start
> > > IMO.
> > > > We
> > > > > > can then hash out what can be generalized and what cannot be and
> > > needs
> > > > to
> > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > >
> > > > > > agree
> > > > > >
> > > > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> > > > > >
> > > > > > > >> We should only stick to Flink Streaming. Furthermore if
> there
> > > is a
> > > > > > > requirement for batch then users
> > > > > > > >> should use Spark or then we will anyway have a beam
> > integration
> > > > > coming
> > > > > > > up.
> > > > > > >
> > > > > > > Currently Spark Streaming micro batching fits well with Hudi,
> > since
> > > > it
> > > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> > > micro
> > > > > > batch
> > > > > > > = 1 hudi commit
> > > > > > > With the per-record model in Flink, I am not sure how useful it
> > > will
> > > > be
> > > > > > to
> > > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> > it
> > > > will
> > > > > > be
> > > > > > > inefficient..
> > > > > > >
> > > > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
> > > full
> > > > > > > summary of how Spark is being used in a wiki page is a good
> start
> > > > IMO.
> > > > > We
> > > > > > > can then hash out what can be generalized and what cannot be
> and
> > > > needs
> > > > > to
> > > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <
> yanghua1...@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Nick and Taher,
> > > > > > > >
> > > > > > > > I just want to answer Nishith's question. Reference his old
> > > > > description
> > > > > > > > here:
> > > > > > > >
> > > > > > > > > You can do a 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread vbal...@apache.org
 
+1 This is a pretty large undertaking. While the community is getting their 
hands dirty and ramping up on Hudi internals, it would be productive if Vinoth 
shepherds this
Balaji.VOn Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth Chandar 
 wrote:  
 
 sg. :)

I will wait for others on this thread as well to chime in.

On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala  wrote:

> Vinoth, I think right now given your experience with the project you should
> be scoping out what needs to be done to take us there. So +1 for giving you
> more work :)
>
> We want to reach a point where we can start scoping out addition of Flink
> and Beam components within. Then I think will tremendous progress.
>
> On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar  wrote:
>
> > I still feel the key thing here is reimplementing HoodieBloomIndex
> without
> > needing spark caching.
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
> >  documents the spark DAG in detail.
> >
> > If everyone feels, it's best for me to scope the work out, then happy to
> do
> > it!
> >
> > On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala 
> > wrote:
> >
> > > Guys I think we are slowing down on this again. We need to start
> planning
> > > small small tasks towards this VC please can you help fast track this?
> > >
> > > Regards,
> > > Taher Koitawala
> > >
> > > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar 
> wrote:
> > >
> > > > Look forward to the analysis. A key class to read would be
> > > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > > >
> > > > On Tue, Aug 13, 2019 at 7:52 PM vino yang 
> > wrote:
> > > >
> > > > > >> Currently Spark Streaming micro batching fits well with Hudi,
> > since
> > > it
> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> micro
> > > > batch
> > > > > = 1 hudi commit
> > > > > With the per-record model in Flink, I am not sure how useful it
> will
> > be
> > > > to
> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> > will
> > > > be
> > > > > inefficient..
> > > > >
> > > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
> > About
> > > > > Flink streaming, we can also implement the "batch" and
> "micro-batch"
> > > > model
> > > > > when process data. For example:
> > > > >
> > > > >    - aggregation: use flexibility window mechanism;
> > > > >    - non-aggregation: use Flink stateful state API cache a batch
> data
> > > > >
> > > > >
> > > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
> > full
> > > > > summary of how Spark is being used in a wiki page is a good start
> > IMO.
> > > We
> > > > > can then hash out what can be generalized and what cannot be and
> > needs
> > > to
> > > > > be left in hudi-client-spark vs hudi-client-core
> > > > >
> > > > > agree
> > > > >
> > > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> > > > >
> > > > > > >> We should only stick to Flink Streaming. Furthermore if there
> > is a
> > > > > > requirement for batch then users
> > > > > > >> should use Spark or then we will anyway have a beam
> integration
> > > > coming
> > > > > > up.
> > > > > >
> > > > > > Currently Spark Streaming micro batching fits well with Hudi,
> since
> > > it
> > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> > micro
> > > > > batch
> > > > > > = 1 hudi commit
> > > > > > With the per-record model in Flink, I am not sure how useful it
> > will
> > > be
> > > > > to
> > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> it
> > > will
> > > > > be
> > > > > > inefficient..
> > > > > >
> > > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
> > full
> > > > > > summary of how Spark is being used in a wiki page is a good start
> > > IMO.
> > > > We
> > > > > > can then hash out what can be generalized and what cannot be and
> > > needs
> > > > to
> > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang  >
> > > > wrote:
> > > > > >
> > > > > > > Hi Nick and Taher,
> > > > > > >
> > > > > > > I just want to answer Nishith's question. Reference his old
> > > > description
> > > > > > > here:
> > > > > > >
> > > > > > > > You can do a parallel investigation while we are deciding on
> > the
> > > > > module
> > > > > > > structure.  You could be looking at all the patterns in Hudi's
> > > Spark
> > > > > APIs
> > > > > > > usage (RDD/DataSource/SparkContext) and see if such support can
> > be
> > > > > > achieved
> > > > > > > in theory with Flink. If not, what is the workaround.
> Documenting
> > > > such
> > > > > > > patterns would be valuable when multiple engineers are working
> on
> > > it.
> > > > > For
> > > > > > > e:g, Hudi relies on    (a) custom partitioning logic for
> > upserts,
> > > > > >  (b)
> > > > > > > caching RDDs to avoid reruns of costly stages    

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
sg. :)

I will wait for others on this thread as well to chime in.

On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala  wrote:

> Vinoth, I think right now given your experience with the project you should
> be scoping out what needs to be done to take us there. So +1 for giving you
> more work :)
>
> We want to reach a point where we can start scoping out addition of Flink
> and Beam components within. Then I think will tremendous progress.
>
> On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar  wrote:
>
> > I still feel the key thing here is reimplementing HoodieBloomIndex
> without
> > needing spark caching.
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
> >  documents the spark DAG in detail.
> >
> > If everyone feels, it's best for me to scope the work out, then happy to
> do
> > it!
> >
> > On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala 
> > wrote:
> >
> > > Guys I think we are slowing down on this again. We need to start
> planning
> > > small small tasks towards this VC please can you help fast track this?
> > >
> > > Regards,
> > > Taher Koitawala
> > >
> > > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar 
> wrote:
> > >
> > > > Look forward to the analysis. A key class to read would be
> > > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > > >
> > > > On Tue, Aug 13, 2019 at 7:52 PM vino yang 
> > wrote:
> > > >
> > > > > >> Currently Spark Streaming micro batching fits well with Hudi,
> > since
> > > it
> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> micro
> > > > batch
> > > > > = 1 hudi commit
> > > > > With the per-record model in Flink, I am not sure how useful it
> will
> > be
> > > > to
> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> > will
> > > > be
> > > > > inefficient..
> > > > >
> > > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
> > About
> > > > > Flink streaming, we can also implement the "batch" and
> "micro-batch"
> > > > model
> > > > > when process data. For example:
> > > > >
> > > > >- aggregation: use flexibility window mechanism;
> > > > >- non-aggregation: use Flink stateful state API cache a batch
> data
> > > > >
> > > > >
> > > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
> > full
> > > > > summary of how Spark is being used in a wiki page is a good start
> > IMO.
> > > We
> > > > > can then hash out what can be generalized and what cannot be and
> > needs
> > > to
> > > > > be left in hudi-client-spark vs hudi-client-core
> > > > >
> > > > > agree
> > > > >
> > > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> > > > >
> > > > > > >> We should only stick to Flink Streaming. Furthermore if there
> > is a
> > > > > > requirement for batch then users
> > > > > > >> should use Spark or then we will anyway have a beam
> integration
> > > > coming
> > > > > > up.
> > > > > >
> > > > > > Currently Spark Streaming micro batching fits well with Hudi,
> since
> > > it
> > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> > micro
> > > > > batch
> > > > > > = 1 hudi commit
> > > > > > With the per-record model in Flink, I am not sure how useful it
> > will
> > > be
> > > > > to
> > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> it
> > > will
> > > > > be
> > > > > > inefficient..
> > > > > >
> > > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
> > full
> > > > > > summary of how Spark is being used in a wiki page is a good start
> > > IMO.
> > > > We
> > > > > > can then hash out what can be generalized and what cannot be and
> > > needs
> > > > to
> > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang  >
> > > > wrote:
> > > > > >
> > > > > > > Hi Nick and Taher,
> > > > > > >
> > > > > > > I just want to answer Nishith's question. Reference his old
> > > > description
> > > > > > > here:
> > > > > > >
> > > > > > > > You can do a parallel investigation while we are deciding on
> > the
> > > > > module
> > > > > > > structure.  You could be looking at all the patterns in Hudi's
> > > Spark
> > > > > APIs
> > > > > > > usage (RDD/DataSource/SparkContext) and see if such support can
> > be
> > > > > > achieved
> > > > > > > in theory with Flink. If not, what is the workaround.
> Documenting
> > > > such
> > > > > > > patterns would be valuable when multiple engineers are working
> on
> > > it.
> > > > > For
> > > > > > > e:g, Hudi relies on (a) custom partitioning logic for
> > upserts,
> > > > > >  (b)
> > > > > > > caching RDDs to avoid reruns of costly stages (c) A Spark
> > > upsert
> > > > > task
> > > > > > > knowing its spark partition/task/attempt ids
> > > > > > >
> > > > > > > And just like the title of this thread, we are going to try to
> > > > decouple
> > > > > > > Hudi and Spark. That means we can run the 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Taher Koitawala
Vinoth, I think right now given your experience with the project you should
be scoping out what needs to be done to take us there. So +1 for giving you
more work :)

We want to reach a point where we can start scoping out addition of Flink
and Beam components within. Then I think will tremendous progress.

On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar  wrote:

> I still feel the key thing here is reimplementing HoodieBloomIndex without
> needing spark caching.
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
>  documents the spark DAG in detail.
>
> If everyone feels, it's best for me to scope the work out, then happy to do
> it!
>
> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala 
> wrote:
>
> > Guys I think we are slowing down on this again. We need to start planning
> > small small tasks towards this VC please can you help fast track this?
> >
> > Regards,
> > Taher Koitawala
> >
> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar  wrote:
> >
> > > Look forward to the analysis. A key class to read would be
> > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > >
> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang 
> wrote:
> > >
> > > > >> Currently Spark Streaming micro batching fits well with Hudi,
> since
> > it
> > > > amortizes the cost of indexing, workload profiling etc. 1 spark micro
> > > batch
> > > > = 1 hudi commit
> > > > With the per-record model in Flink, I am not sure how useful it will
> be
> > > to
> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> will
> > > be
> > > > inefficient..
> > > >
> > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
> About
> > > > Flink streaming, we can also implement the "batch" and "micro-batch"
> > > model
> > > > when process data. For example:
> > > >
> > > >- aggregation: use flexibility window mechanism;
> > > >- non-aggregation: use Flink stateful state API cache a batch data
> > > >
> > > >
> > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
> full
> > > > summary of how Spark is being used in a wiki page is a good start
> IMO.
> > We
> > > > can then hash out what can be generalized and what cannot be and
> needs
> > to
> > > > be left in hudi-client-spark vs hudi-client-core
> > > >
> > > > agree
> > > >
> > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> > > >
> > > > > >> We should only stick to Flink Streaming. Furthermore if there
> is a
> > > > > requirement for batch then users
> > > > > >> should use Spark or then we will anyway have a beam integration
> > > coming
> > > > > up.
> > > > >
> > > > > Currently Spark Streaming micro batching fits well with Hudi, since
> > it
> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> micro
> > > > batch
> > > > > = 1 hudi commit
> > > > > With the per-record model in Flink, I am not sure how useful it
> will
> > be
> > > > to
> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> > will
> > > > be
> > > > > inefficient..
> > > > >
> > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
> full
> > > > > summary of how Spark is being used in a wiki page is a good start
> > IMO.
> > > We
> > > > > can then hash out what can be generalized and what cannot be and
> > needs
> > > to
> > > > > be left in hudi-client-spark vs hudi-client-core
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang 
> > > wrote:
> > > > >
> > > > > > Hi Nick and Taher,
> > > > > >
> > > > > > I just want to answer Nishith's question. Reference his old
> > > description
> > > > > > here:
> > > > > >
> > > > > > > You can do a parallel investigation while we are deciding on
> the
> > > > module
> > > > > > structure.  You could be looking at all the patterns in Hudi's
> > Spark
> > > > APIs
> > > > > > usage (RDD/DataSource/SparkContext) and see if such support can
> be
> > > > > achieved
> > > > > > in theory with Flink. If not, what is the workaround. Documenting
> > > such
> > > > > > patterns would be valuable when multiple engineers are working on
> > it.
> > > > For
> > > > > > e:g, Hudi relies on (a) custom partitioning logic for
> upserts,
> > > > >  (b)
> > > > > > caching RDDs to avoid reruns of costly stages (c) A Spark
> > upsert
> > > > task
> > > > > > knowing its spark partition/task/attempt ids
> > > > > >
> > > > > > And just like the title of this thread, we are going to try to
> > > decouple
> > > > > > Hudi and Spark. That means we can run the whole Hudi without
> > > depending
> > > > > > Spark. So we need to analyze all the usage of Spark in Hudi.
> > > > > >
> > > > > > Here we are not discussing the integration of Hudi and Flink in
> the
> > > > > > application layer. Instead, I want Hudi to be decoupled from
> Spark
> > > and
> > > > > > allow other engines (such as Flink) to replace Spark.
> > > > > >
> > > > > > It can be divided into long-term goals and 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
I still feel the key thing here is reimplementing HoodieBloomIndex without
needing spark caching.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
 documents the spark DAG in detail.

If everyone feels, it's best for me to scope the work out, then happy to do
it!

On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala  wrote:

> Guys I think we are slowing down on this again. We need to start planning
> small small tasks towards this VC please can you help fast track this?
>
> Regards,
> Taher Koitawala
>
> On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar  wrote:
>
> > Look forward to the analysis. A key class to read would be
> > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> >
> > On Tue, Aug 13, 2019 at 7:52 PM vino yang  wrote:
> >
> > > >> Currently Spark Streaming micro batching fits well with Hudi, since
> it
> > > amortizes the cost of indexing, workload profiling etc. 1 spark micro
> > batch
> > > = 1 hudi commit
> > > With the per-record model in Flink, I am not sure how useful it will be
> > to
> > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it will
> > be
> > > inefficient..
> > >
> > > Yes, if 1 input record = 1 hudi commit, it would be inefficient. About
> > > Flink streaming, we can also implement the "batch" and "micro-batch"
> > model
> > > when process data. For example:
> > >
> > >- aggregation: use flexibility window mechanism;
> > >- non-aggregation: use Flink stateful state API cache a batch data
> > >
> > >
> > > >> On first focussing on decoupling of Spark and Hudi alone, yes a full
> > > summary of how Spark is being used in a wiki page is a good start IMO.
> We
> > > can then hash out what can be generalized and what cannot be and needs
> to
> > > be left in hudi-client-spark vs hudi-client-core
> > >
> > > agree
> > >
> > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> > >
> > > > >> We should only stick to Flink Streaming. Furthermore if there is a
> > > > requirement for batch then users
> > > > >> should use Spark or then we will anyway have a beam integration
> > coming
> > > > up.
> > > >
> > > > Currently Spark Streaming micro batching fits well with Hudi, since
> it
> > > > amortizes the cost of indexing, workload profiling etc. 1 spark micro
> > > batch
> > > > = 1 hudi commit
> > > > With the per-record model in Flink, I am not sure how useful it will
> be
> > > to
> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> will
> > > be
> > > > inefficient..
> > > >
> > > > On first focussing on decoupling of Spark and Hudi alone, yes a full
> > > > summary of how Spark is being used in a wiki page is a good start
> IMO.
> > We
> > > > can then hash out what can be generalized and what cannot be and
> needs
> > to
> > > > be left in hudi-client-spark vs hudi-client-core
> > > >
> > > >
> > > >
> > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang 
> > wrote:
> > > >
> > > > > Hi Nick and Taher,
> > > > >
> > > > > I just want to answer Nishith's question. Reference his old
> > description
> > > > > here:
> > > > >
> > > > > > You can do a parallel investigation while we are deciding on the
> > > module
> > > > > structure.  You could be looking at all the patterns in Hudi's
> Spark
> > > APIs
> > > > > usage (RDD/DataSource/SparkContext) and see if such support can be
> > > > achieved
> > > > > in theory with Flink. If not, what is the workaround. Documenting
> > such
> > > > > patterns would be valuable when multiple engineers are working on
> it.
> > > For
> > > > > e:g, Hudi relies on (a) custom partitioning logic for upserts,
> > > >  (b)
> > > > > caching RDDs to avoid reruns of costly stages (c) A Spark
> upsert
> > > task
> > > > > knowing its spark partition/task/attempt ids
> > > > >
> > > > > And just like the title of this thread, we are going to try to
> > decouple
> > > > > Hudi and Spark. That means we can run the whole Hudi without
> > depending
> > > > > Spark. So we need to analyze all the usage of Spark in Hudi.
> > > > >
> > > > > Here we are not discussing the integration of Hudi and Flink in the
> > > > > application layer. Instead, I want Hudi to be decoupled from Spark
> > and
> > > > > allow other engines (such as Flink) to replace Spark.
> > > > >
> > > > > It can be divided into long-term goals and short-term goals. As
> > Nishith
> > > > > stated in a recent email.
> > > > >
> > > > > I mentioned the Flink Batch API here because Hudi can connect with
> > many
> > > > > different Source/Sinks. Some file-based reads are not appropriate
> for
> > > > Flink
> > > > > Streaming.
> > > > >
> > > > > Therefore, this is a comprehensive survey of the use of Spark in
> > Hudi.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > >
> > > > > taher koitawala  于2019年8月13日周二 下午5:43写道:
> > > > >
> > > > > > Hi Vino,
> > > > > >   According to what I've seen Hudi has a lot of spark
> component
> > > > > flowing
> > > > > > throwing it. Like 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Taher Koitawala
Guys I think we are slowing down on this again. We need to start planning
small small tasks towards this VC please can you help fast track this?

Regards,
Taher Koitawala

On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar  wrote:

> Look forward to the analysis. A key class to read would be
> HoodieBloomIndex, which uses a lot of spark caching and shuffles.
>
> On Tue, Aug 13, 2019 at 7:52 PM vino yang  wrote:
>
> > >> Currently Spark Streaming micro batching fits well with Hudi, since it
> > amortizes the cost of indexing, workload profiling etc. 1 spark micro
> batch
> > = 1 hudi commit
> > With the per-record model in Flink, I am not sure how useful it will be
> to
> > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it will
> be
> > inefficient..
> >
> > Yes, if 1 input record = 1 hudi commit, it would be inefficient. About
> > Flink streaming, we can also implement the "batch" and "micro-batch"
> model
> > when process data. For example:
> >
> >- aggregation: use flexibility window mechanism;
> >- non-aggregation: use Flink stateful state API cache a batch data
> >
> >
> > >> On first focussing on decoupling of Spark and Hudi alone, yes a full
> > summary of how Spark is being used in a wiki page is a good start IMO. We
> > can then hash out what can be generalized and what cannot be and needs to
> > be left in hudi-client-spark vs hudi-client-core
> >
> > agree
> >
> > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> >
> > > >> We should only stick to Flink Streaming. Furthermore if there is a
> > > requirement for batch then users
> > > >> should use Spark or then we will anyway have a beam integration
> coming
> > > up.
> > >
> > > Currently Spark Streaming micro batching fits well with Hudi, since it
> > > amortizes the cost of indexing, workload profiling etc. 1 spark micro
> > batch
> > > = 1 hudi commit
> > > With the per-record model in Flink, I am not sure how useful it will be
> > to
> > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it will
> > be
> > > inefficient..
> > >
> > > On first focussing on decoupling of Spark and Hudi alone, yes a full
> > > summary of how Spark is being used in a wiki page is a good start IMO.
> We
> > > can then hash out what can be generalized and what cannot be and needs
> to
> > > be left in hudi-client-spark vs hudi-client-core
> > >
> > >
> > >
> > > On Tue, Aug 13, 2019 at 3:57 AM vino yang 
> wrote:
> > >
> > > > Hi Nick and Taher,
> > > >
> > > > I just want to answer Nishith's question. Reference his old
> description
> > > > here:
> > > >
> > > > > You can do a parallel investigation while we are deciding on the
> > module
> > > > structure.  You could be looking at all the patterns in Hudi's Spark
> > APIs
> > > > usage (RDD/DataSource/SparkContext) and see if such support can be
> > > achieved
> > > > in theory with Flink. If not, what is the workaround. Documenting
> such
> > > > patterns would be valuable when multiple engineers are working on it.
> > For
> > > > e:g, Hudi relies on (a) custom partitioning logic for upserts,
> > >  (b)
> > > > caching RDDs to avoid reruns of costly stages (c) A Spark upsert
> > task
> > > > knowing its spark partition/task/attempt ids
> > > >
> > > > And just like the title of this thread, we are going to try to
> decouple
> > > > Hudi and Spark. That means we can run the whole Hudi without
> depending
> > > > Spark. So we need to analyze all the usage of Spark in Hudi.
> > > >
> > > > Here we are not discussing the integration of Hudi and Flink in the
> > > > application layer. Instead, I want Hudi to be decoupled from Spark
> and
> > > > allow other engines (such as Flink) to replace Spark.
> > > >
> > > > It can be divided into long-term goals and short-term goals. As
> Nishith
> > > > stated in a recent email.
> > > >
> > > > I mentioned the Flink Batch API here because Hudi can connect with
> many
> > > > different Source/Sinks. Some file-based reads are not appropriate for
> > > Flink
> > > > Streaming.
> > > >
> > > > Therefore, this is a comprehensive survey of the use of Spark in
> Hudi.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > >
> > > > taher koitawala  于2019年8月13日周二 下午5:43写道:
> > > >
> > > > > Hi Vino,
> > > > >   According to what I've seen Hudi has a lot of spark component
> > > > flowing
> > > > > throwing it. Like Taskcontexts, JavaSparkContexts etc. The main
> > > classes I
> > > > > guess we should focus upon is HoodieTable and Hoodie write clients.
> > > > >
> > > > > Also Vino, I don't think we should be providing Flink dataset
> > > > > implementation. We should only stick to Flink Streaming.
> > > > >Furthermore if there is a requirement for batch then
> > > users
> > > > > should use Spark or then we will anyway have a beam integration
> > coming
> > > > up.
> > > > >
> > > > > As of cache, How about we write our stateful Flink function and use
> > > > > RocksDbStateBackend with some state TTL.
> > > > >
> > > > > On Tue, Aug 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-14 Thread Vinoth Chandar
Look forward to the analysis. A key class to read would be
HoodieBloomIndex, which uses a lot of spark caching and shuffles.

On Tue, Aug 13, 2019 at 7:52 PM vino yang  wrote:

> >> Currently Spark Streaming micro batching fits well with Hudi, since it
> amortizes the cost of indexing, workload profiling etc. 1 spark micro batch
> = 1 hudi commit
> With the per-record model in Flink, I am not sure how useful it will be to
> support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it will be
> inefficient..
>
> Yes, if 1 input record = 1 hudi commit, it would be inefficient. About
> Flink streaming, we can also implement the "batch" and "micro-batch" model
> when process data. For example:
>
>- aggregation: use flexibility window mechanism;
>- non-aggregation: use Flink stateful state API cache a batch data
>
>
> >> On first focussing on decoupling of Spark and Hudi alone, yes a full
> summary of how Spark is being used in a wiki page is a good start IMO. We
> can then hash out what can be generalized and what cannot be and needs to
> be left in hudi-client-spark vs hudi-client-core
>
> agree
>
> Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
>
> > >> We should only stick to Flink Streaming. Furthermore if there is a
> > requirement for batch then users
> > >> should use Spark or then we will anyway have a beam integration coming
> > up.
> >
> > Currently Spark Streaming micro batching fits well with Hudi, since it
> > amortizes the cost of indexing, workload profiling etc. 1 spark micro
> batch
> > = 1 hudi commit
> > With the per-record model in Flink, I am not sure how useful it will be
> to
> > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it will
> be
> > inefficient..
> >
> > On first focussing on decoupling of Spark and Hudi alone, yes a full
> > summary of how Spark is being used in a wiki page is a good start IMO. We
> > can then hash out what can be generalized and what cannot be and needs to
> > be left in hudi-client-spark vs hudi-client-core
> >
> >
> >
> > On Tue, Aug 13, 2019 at 3:57 AM vino yang  wrote:
> >
> > > Hi Nick and Taher,
> > >
> > > I just want to answer Nishith's question. Reference his old description
> > > here:
> > >
> > > > You can do a parallel investigation while we are deciding on the
> module
> > > structure.  You could be looking at all the patterns in Hudi's Spark
> APIs
> > > usage (RDD/DataSource/SparkContext) and see if such support can be
> > achieved
> > > in theory with Flink. If not, what is the workaround. Documenting such
> > > patterns would be valuable when multiple engineers are working on it.
> For
> > > e:g, Hudi relies on (a) custom partitioning logic for upserts,
> >  (b)
> > > caching RDDs to avoid reruns of costly stages (c) A Spark upsert
> task
> > > knowing its spark partition/task/attempt ids
> > >
> > > And just like the title of this thread, we are going to try to decouple
> > > Hudi and Spark. That means we can run the whole Hudi without depending
> > > Spark. So we need to analyze all the usage of Spark in Hudi.
> > >
> > > Here we are not discussing the integration of Hudi and Flink in the
> > > application layer. Instead, I want Hudi to be decoupled from Spark and
> > > allow other engines (such as Flink) to replace Spark.
> > >
> > > It can be divided into long-term goals and short-term goals. As Nishith
> > > stated in a recent email.
> > >
> > > I mentioned the Flink Batch API here because Hudi can connect with many
> > > different Source/Sinks. Some file-based reads are not appropriate for
> > Flink
> > > Streaming.
> > >
> > > Therefore, this is a comprehensive survey of the use of Spark in Hudi.
> > >
> > > Best,
> > > Vino
> > >
> > >
> > > taher koitawala  于2019年8月13日周二 下午5:43写道:
> > >
> > > > Hi Vino,
> > > >   According to what I've seen Hudi has a lot of spark component
> > > flowing
> > > > throwing it. Like Taskcontexts, JavaSparkContexts etc. The main
> > classes I
> > > > guess we should focus upon is HoodieTable and Hoodie write clients.
> > > >
> > > > Also Vino, I don't think we should be providing Flink dataset
> > > > implementation. We should only stick to Flink Streaming.
> > > >Furthermore if there is a requirement for batch then
> > users
> > > > should use Spark or then we will anyway have a beam integration
> coming
> > > up.
> > > >
> > > > As of cache, How about we write our stateful Flink function and use
> > > > RocksDbStateBackend with some state TTL.
> > > >
> > > > On Tue, Aug 13, 2019, 2:28 PM vino yang 
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > After doing some research, let me share my information:
> > > > >
> > > > >
> > > > >- Limitation of computing engine capabilities: Hudi uses Spark's
> > > > >RDD#persist, and Flink currently has no API to cache datasets.
> > Maybe
> > > > we
> > > > > can
> > > > >only choose to use external storage or do not use cache? For the
> > use
> > > > of
> > > > >other APIs, the two currently 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
>> Currently Spark Streaming micro batching fits well with Hudi, since it
amortizes the cost of indexing, workload profiling etc. 1 spark micro batch
= 1 hudi commit
With the per-record model in Flink, I am not sure how useful it will be to
support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it will be
inefficient..

Yes, if 1 input record = 1 hudi commit, it would be inefficient. About
Flink streaming, we can also implement the "batch" and "micro-batch" model
when process data. For example:

   - aggregation: use flexibility window mechanism;
   - non-aggregation: use Flink stateful state API cache a batch data


>> On first focussing on decoupling of Spark and Hudi alone, yes a full
summary of how Spark is being used in a wiki page is a good start IMO. We
can then hash out what can be generalized and what cannot be and needs to
be left in hudi-client-spark vs hudi-client-core

agree

Vinoth Chandar  于2019年8月14日周三 上午8:35写道:

> >> We should only stick to Flink Streaming. Furthermore if there is a
> requirement for batch then users
> >> should use Spark or then we will anyway have a beam integration coming
> up.
>
> Currently Spark Streaming micro batching fits well with Hudi, since it
> amortizes the cost of indexing, workload profiling etc. 1 spark micro batch
> = 1 hudi commit
> With the per-record model in Flink, I am not sure how useful it will be to
> support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it will be
> inefficient..
>
> On first focussing on decoupling of Spark and Hudi alone, yes a full
> summary of how Spark is being used in a wiki page is a good start IMO. We
> can then hash out what can be generalized and what cannot be and needs to
> be left in hudi-client-spark vs hudi-client-core
>
>
>
> On Tue, Aug 13, 2019 at 3:57 AM vino yang  wrote:
>
> > Hi Nick and Taher,
> >
> > I just want to answer Nishith's question. Reference his old description
> > here:
> >
> > > You can do a parallel investigation while we are deciding on the module
> > structure.  You could be looking at all the patterns in Hudi's Spark APIs
> > usage (RDD/DataSource/SparkContext) and see if such support can be
> achieved
> > in theory with Flink. If not, what is the workaround. Documenting such
> > patterns would be valuable when multiple engineers are working on it. For
> > e:g, Hudi relies on (a) custom partitioning logic for upserts,
>  (b)
> > caching RDDs to avoid reruns of costly stages (c) A Spark upsert task
> > knowing its spark partition/task/attempt ids
> >
> > And just like the title of this thread, we are going to try to decouple
> > Hudi and Spark. That means we can run the whole Hudi without depending
> > Spark. So we need to analyze all the usage of Spark in Hudi.
> >
> > Here we are not discussing the integration of Hudi and Flink in the
> > application layer. Instead, I want Hudi to be decoupled from Spark and
> > allow other engines (such as Flink) to replace Spark.
> >
> > It can be divided into long-term goals and short-term goals. As Nishith
> > stated in a recent email.
> >
> > I mentioned the Flink Batch API here because Hudi can connect with many
> > different Source/Sinks. Some file-based reads are not appropriate for
> Flink
> > Streaming.
> >
> > Therefore, this is a comprehensive survey of the use of Spark in Hudi.
> >
> > Best,
> > Vino
> >
> >
> > taher koitawala  于2019年8月13日周二 下午5:43写道:
> >
> > > Hi Vino,
> > >   According to what I've seen Hudi has a lot of spark component
> > flowing
> > > throwing it. Like Taskcontexts, JavaSparkContexts etc. The main
> classes I
> > > guess we should focus upon is HoodieTable and Hoodie write clients.
> > >
> > > Also Vino, I don't think we should be providing Flink dataset
> > > implementation. We should only stick to Flink Streaming.
> > >Furthermore if there is a requirement for batch then
> users
> > > should use Spark or then we will anyway have a beam integration coming
> > up.
> > >
> > > As of cache, How about we write our stateful Flink function and use
> > > RocksDbStateBackend with some state TTL.
> > >
> > > On Tue, Aug 13, 2019, 2:28 PM vino yang  wrote:
> > >
> > > > Hi all,
> > > >
> > > > After doing some research, let me share my information:
> > > >
> > > >
> > > >- Limitation of computing engine capabilities: Hudi uses Spark's
> > > >RDD#persist, and Flink currently has no API to cache datasets.
> Maybe
> > > we
> > > > can
> > > >only choose to use external storage or do not use cache? For the
> use
> > > of
> > > >other APIs, the two currently offer almost equivalent
> capabilities.
> > > >- The abstraction of the computing engine is different:
> Considering
> > > the
> > > >different usage scenarios of the computing engine in Hudi, Flink
> has
> > > not
> > > >yet implemented stream batch unification, so we may use both
> Flink's
> > > >DataSet API (batch processing) and DataStream API (stream
> > processing).
> > > >
> > > > Best,
> > > > 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
Hi Nick and Taher,

I just want to answer Nishith's question. Reference his old description
here:

> You can do a parallel investigation while we are deciding on the module
structure.  You could be looking at all the patterns in Hudi's Spark APIs
usage (RDD/DataSource/SparkContext) and see if such support can be achieved
in theory with Flink. If not, what is the workaround. Documenting such
patterns would be valuable when multiple engineers are working on it. For
e:g, Hudi relies on (a) custom partitioning logic for upserts, (b)
caching RDDs to avoid reruns of costly stages (c) A Spark upsert task
knowing its spark partition/task/attempt ids

And just like the title of this thread, we are going to try to decouple
Hudi and Spark. That means we can run the whole Hudi without depending
Spark. So we need to analyze all the usage of Spark in Hudi.

Here we are not discussing the integration of Hudi and Flink in the
application layer. Instead, I want Hudi to be decoupled from Spark and
allow other engines (such as Flink) to replace Spark.

It can be divided into long-term goals and short-term goals. As Nishith
stated in a recent email.

I mentioned the Flink Batch API here because Hudi can connect with many
different Source/Sinks. Some file-based reads are not appropriate for Flink
Streaming.

Therefore, this is a comprehensive survey of the use of Spark in Hudi.

Best,
Vino


taher koitawala  于2019年8月13日周二 下午5:43写道:

> Hi Vino,
>   According to what I've seen Hudi has a lot of spark component flowing
> throwing it. Like Taskcontexts, JavaSparkContexts etc. The main classes I
> guess we should focus upon is HoodieTable and Hoodie write clients.
>
> Also Vino, I don't think we should be providing Flink dataset
> implementation. We should only stick to Flink Streaming.
>Furthermore if there is a requirement for batch then users
> should use Spark or then we will anyway have a beam integration coming up.
>
> As of cache, How about we write our stateful Flink function and use
> RocksDbStateBackend with some state TTL.
>
> On Tue, Aug 13, 2019, 2:28 PM vino yang  wrote:
>
> > Hi all,
> >
> > After doing some research, let me share my information:
> >
> >
> >- Limitation of computing engine capabilities: Hudi uses Spark's
> >RDD#persist, and Flink currently has no API to cache datasets. Maybe
> we
> > can
> >only choose to use external storage or do not use cache? For the use
> of
> >other APIs, the two currently offer almost equivalent capabilities.
> >- The abstraction of the computing engine is different: Considering
> the
> >different usage scenarios of the computing engine in Hudi, Flink has
> not
> >yet implemented stream batch unification, so we may use both Flink's
> >DataSet API (batch processing) and DataStream API (stream processing).
> >
> > Best,
> > Vino
> >
> > nishith agarwal  于2019年8月8日周四 上午12:57写道:
> >
> > > Nick,
> > >
> > > You bring up a good point about the non-trivial programming model
> > > differences between these different technologies. From a theoretical
> > > perspective, I'd say considering a higher level abstraction makes
> sense.
> > I
> > > think we have to decouple some objectives and concerns here.
> > >
> > > a) The immediate desire is to have Hudi be able to run on a Flink (or
> > > non-spark) engine. This naturally begs the question of decoupling Hudi
> > > concepts from direct Spark dependencies.
> > >
> > > b) If we do want to initiate the above effort, would it make sense to
> > just
> > > have a higher level abstraction, building on other technologies like
> beam
> > > (euphoria etc) and provide single, clean API's that may be more
> > > maintainable from a code perspective. But at the same time this will
> > > introduce challenges on how to maintain efficiency and optimized
> runtime
> > > dags for Hudi (since the code would move away from point integrations
> and
> > > whenever this happens, tuning natively for specific engines becomes
> more
> > > and more difficult).
> > >
> > > My general opinion is that, as the community grows over time with more
> > > folks having an in-depth understanding of Hudi, going from
> current_state
> > ->
> > > (a) -> (b) might be the most reliable and adoptable path for this
> > project.
> > >
> > > Thanks,
> > > Nishith
> > >
> > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng 
> > > wrote:
> > >
> > > > There are some not trivial difference between programming model and
> > > > runtime semantics between Beam, Spark and Flink.
> > > >
> > > >
> > > >
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > > >
> > > > Nitish, Vino - thoughts?
> > > >
> > > > Does it feel to consider a higher level abstraction / DSL instead of
> > > > maintaining different code with same functionality but different
> > > > programming models ?
> > > >
> > > > https://beam.apache.org/documentation/sdks/java/euphoria/
> > > >
> > > > Nick
> > > >
> > > >
> > > 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread taher koitawala
Hi Vino,
  According to what I've seen Hudi has a lot of spark component flowing
throwing it. Like Taskcontexts, JavaSparkContexts etc. The main classes I
guess we should focus upon is HoodieTable and Hoodie write clients.

Also Vino, I don't think we should be providing Flink dataset
implementation. We should only stick to Flink Streaming.
   Furthermore if there is a requirement for batch then users
should use Spark or then we will anyway have a beam integration coming up.

As of cache, How about we write our stateful Flink function and use
RocksDbStateBackend with some state TTL.

On Tue, Aug 13, 2019, 2:28 PM vino yang  wrote:

> Hi all,
>
> After doing some research, let me share my information:
>
>
>- Limitation of computing engine capabilities: Hudi uses Spark's
>RDD#persist, and Flink currently has no API to cache datasets. Maybe we
> can
>only choose to use external storage or do not use cache? For the use of
>other APIs, the two currently offer almost equivalent capabilities.
>- The abstraction of the computing engine is different: Considering the
>different usage scenarios of the computing engine in Hudi, Flink has not
>yet implemented stream batch unification, so we may use both Flink's
>DataSet API (batch processing) and DataStream API (stream processing).
>
> Best,
> Vino
>
> nishith agarwal  于2019年8月8日周四 上午12:57写道:
>
> > Nick,
> >
> > You bring up a good point about the non-trivial programming model
> > differences between these different technologies. From a theoretical
> > perspective, I'd say considering a higher level abstraction makes sense.
> I
> > think we have to decouple some objectives and concerns here.
> >
> > a) The immediate desire is to have Hudi be able to run on a Flink (or
> > non-spark) engine. This naturally begs the question of decoupling Hudi
> > concepts from direct Spark dependencies.
> >
> > b) If we do want to initiate the above effort, would it make sense to
> just
> > have a higher level abstraction, building on other technologies like beam
> > (euphoria etc) and provide single, clean API's that may be more
> > maintainable from a code perspective. But at the same time this will
> > introduce challenges on how to maintain efficiency and optimized runtime
> > dags for Hudi (since the code would move away from point integrations and
> > whenever this happens, tuning natively for specific engines becomes more
> > and more difficult).
> >
> > My general opinion is that, as the community grows over time with more
> > folks having an in-depth understanding of Hudi, going from current_state
> ->
> > (a) -> (b) might be the most reliable and adoptable path for this
> project.
> >
> > Thanks,
> > Nishith
> >
> > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng 
> > wrote:
> >
> > > There are some not trivial difference between programming model and
> > > runtime semantics between Beam, Spark and Flink.
> > >
> > >
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > >
> > > Nitish, Vino - thoughts?
> > >
> > > Does it feel to consider a higher level abstraction / DSL instead of
> > > maintaining different code with same functionality but different
> > > programming models ?
> > >
> > > https://beam.apache.org/documentation/sdks/java/euphoria/
> > >
> > > Nick
> > >
> > >
> > >
> > >
> > > On August 6, 2019 at 4:04 PM nishith agarwal 
> > wrote:
> > >
> > >
> > > +1 for Approach 1 Point integration with each framework.
> > >
> > > Pros for point integration
> > >
> > >- Hudi community is already familiar with spark and spark based
> > >
> > >
> > > actions/shuffles etc. Since both modules can be decoupled, this enables
> > us
> > > to have a steady release for Hudi for 1 execution engine (spark) while
> we
> > > hone our skills and iterate on making flink dag optimized, performant
> > with
> > > the right configuration.
> > >
> > >- This might be a stepping stone towards rewriting the entire code
> > base
> > >
> > >
> > > being agnostic of spark/flink. This approach will help us fix tests,
> > > intricacies and help make the code base ready for a larger rework.
> > >
> > >- Seems like the easiest way to add flink support
> > >
> > >
> > >
> > > Cons
> > >
> > >- More code paths to maintain and reason since the spark and flink
> > >
> > >
> > > integrations will naturally diverge over time.
> > >
> > > Theoretically, I do like the idea of being able to run the hudi dag on
> > beam
> > > more than point integrations, where there is one API/logic to reason
> > about.
> > > But practically, that may not be the right direction.
> > >
> > > Pros
> > >
> > >- Lesser cognitive burden in maintaining, evolving and releasing the
> > >
> > >
> > > project with one API to reason with.
> > >
> > >- Theoretically, going forward assuming beam is adopted as a
> standard
> > >
> > >
> > > programming paradigm for stream/batch, this would enable consumers
> > leverage
> > > the 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
Hi all,

After doing some research, let me share my information:


   - Limitation of computing engine capabilities: Hudi uses Spark's
   RDD#persist, and Flink currently has no API to cache datasets. Maybe we can
   only choose to use external storage or do not use cache? For the use of
   other APIs, the two currently offer almost equivalent capabilities.
   - The abstraction of the computing engine is different: Considering the
   different usage scenarios of the computing engine in Hudi, Flink has not
   yet implemented stream batch unification, so we may use both Flink's
   DataSet API (batch processing) and DataStream API (stream processing).

Best,
Vino

nishith agarwal  于2019年8月8日周四 上午12:57写道:

> Nick,
>
> You bring up a good point about the non-trivial programming model
> differences between these different technologies. From a theoretical
> perspective, I'd say considering a higher level abstraction makes sense. I
> think we have to decouple some objectives and concerns here.
>
> a) The immediate desire is to have Hudi be able to run on a Flink (or
> non-spark) engine. This naturally begs the question of decoupling Hudi
> concepts from direct Spark dependencies.
>
> b) If we do want to initiate the above effort, would it make sense to just
> have a higher level abstraction, building on other technologies like beam
> (euphoria etc) and provide single, clean API's that may be more
> maintainable from a code perspective. But at the same time this will
> introduce challenges on how to maintain efficiency and optimized runtime
> dags for Hudi (since the code would move away from point integrations and
> whenever this happens, tuning natively for specific engines becomes more
> and more difficult).
>
> My general opinion is that, as the community grows over time with more
> folks having an in-depth understanding of Hudi, going from current_state ->
> (a) -> (b) might be the most reliable and adoptable path for this project.
>
> Thanks,
> Nishith
>
> On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng 
> wrote:
>
> > There are some not trivial difference between programming model and
> > runtime semantics between Beam, Spark and Flink.
> >
> >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> >
> > Nitish, Vino - thoughts?
> >
> > Does it feel to consider a higher level abstraction / DSL instead of
> > maintaining different code with same functionality but different
> > programming models ?
> >
> > https://beam.apache.org/documentation/sdks/java/euphoria/
> >
> > Nick
> >
> >
> >
> >
> > On August 6, 2019 at 4:04 PM nishith agarwal 
> wrote:
> >
> >
> > +1 for Approach 1 Point integration with each framework.
> >
> > Pros for point integration
> >
> >- Hudi community is already familiar with spark and spark based
> >
> >
> > actions/shuffles etc. Since both modules can be decoupled, this enables
> us
> > to have a steady release for Hudi for 1 execution engine (spark) while we
> > hone our skills and iterate on making flink dag optimized, performant
> with
> > the right configuration.
> >
> >- This might be a stepping stone towards rewriting the entire code
> base
> >
> >
> > being agnostic of spark/flink. This approach will help us fix tests,
> > intricacies and help make the code base ready for a larger rework.
> >
> >- Seems like the easiest way to add flink support
> >
> >
> >
> > Cons
> >
> >- More code paths to maintain and reason since the spark and flink
> >
> >
> > integrations will naturally diverge over time.
> >
> > Theoretically, I do like the idea of being able to run the hudi dag on
> beam
> > more than point integrations, where there is one API/logic to reason
> about.
> > But practically, that may not be the right direction.
> >
> > Pros
> >
> >- Lesser cognitive burden in maintaining, evolving and releasing the
> >
> >
> > project with one API to reason with.
> >
> >- Theoretically, going forward assuming beam is adopted as a standard
> >
> >
> > programming paradigm for stream/batch, this would enable consumers
> leverage
> > the power of hudi more easily.
> >
> > Cons
> >
> >- Massive rewrite of the code base. Additionally, since we would have
> >moved
> >
> >
> > away from directly using spark APIs, there is a bigger risk of
> regression.
> > We would have to be very thorough with all the intricacies and ensure the
> > same stability of new releases.
> >
> >- Managing future features (which may be very spark driven) will
> either
> >
> >
> > clash or pause or will need to be reworked.
> >
> >- Tuning jobs for Spark/Flink type execution frameworks individually
> >might
> >
> >
> > be difficult and will get difficult over time as the project evolves,
> where
> > some beam integrations with spark/flink may not work as expected.
> >
> >- Also, as pointed above, need to probably support the hoodie-spark
> >module
> >
> >
> > as a first-class.
> >
> > Thank,
> > Nishith
> >
> >
> > On Tue, Aug 6, 2019 at 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-07 Thread taher koitawala
Thanks a ton Vinoth.

On Wed, Aug 7, 2019 at 4:34 PM Vinoth Chandar  wrote:

> >>Are there some tasks I can take up to ramp up the code?
> Certainly. There are some open tasks that touch the hoodie-client and
> hoodie-utilities module.
> https://issues.apache.org/jira/browse/HUDI-37
> https://issues.apache.org/jira/browse/HUDI-194
> https://issues.apache.org/jira/browse/HUDI-145
> https://issues.apache.org/jira/browse/HUDI-130
> https://issues.apache.org/jira/browse/HUDI-62
>
> IMO, getting hands dirty with a few of these and may be 1-2 more involved
> ones, would set enough context to drive the hudi-on-flink project.
>
>
> On Tue, Aug 6, 2019 at 1:04 PM nishith agarwal 
> wrote:
>
> > +1 for Approach 1 Point integration with each framework.
> >
> > Pros for point integration
> > - Hudi community is already familiar with spark and spark based
> > actions/shuffles etc. Since both modules can be decoupled, this enables
> us
> > to have a steady release for Hudi for 1 execution engine (spark) while we
> > hone our skills and iterate on making flink dag optimized, performant
> with
> > the right configuration.
> > - This might be a stepping stone towards rewriting the entire code base
> > being agnostic of spark/flink. This approach will help us fix tests,
> > intricacies and help make the code base ready for a larger rework.
> > - Seems like the easiest way to add flink support
> >
> > Cons
> > - More code paths to maintain and reason since the spark and flink
> > integrations will naturally diverge over time.
> >
> > Theoretically, I do like the idea of being able to run the hudi dag on
> beam
> > more than point integrations, where there is one API/logic to reason
> about.
> > But practically, that may not be the right direction.
> >
> > Pros
> > - Lesser cognitive burden in maintaining, evolving and releasing the
> > project with one API to reason with.
> > - Theoretically, going forward assuming beam is adopted as a standard
> > programming paradigm for stream/batch, this would enable consumers
> leverage
> > the power of hudi more easily.
> >
> > Cons
> > - Massive rewrite of the code base. Additionally, since we would have
> moved
> > away from directly using spark APIs, there is a bigger risk of
> regression.
> > We would have to be very thorough with all the intricacies and ensure the
> > same stability of new releases.
> > - Managing future features (which may be very spark driven) will either
> > clash or pause or will need to be reworked.
> > - Tuning jobs for Spark/Flink type execution frameworks individually
> might
> > be difficult and will get difficult over time as the project evolves,
> where
> > some beam integrations with spark/flink may not work as expected.
> > - Also, as pointed above, need to probably support the hoodie-spark
> module
> > as a first-class.
> >
> > Thank,
> > Nishith
> >
> >
> > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala 
> wrote:
> >
> > > Hi Vinoth,
> > > Are there some tasks I can take up to ramp up the code? Want to
> > get
> > > more used to the code and understand the existing implementation
> better.
> > >
> > > Thanks,
> > > Taher Koitawala
> > >
> > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar 
> wrote:
> > >
> > > > Let's see if others have any thoughts as well. We can plan to fix the
> > > > approach by EOW.
> > > >
> > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang 
> > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > Also, +1 for Approach 1 like Taher.
> > > > >
> > > > > > If we can do a comprehensive analysis of this model and come up
> > with.
> > > > > means
> > > > > > to refactor this cleanly, this would be promising.
> > > > >
> > > > > Yes, when we get the conclusion, we could start this work.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > >
> > > > > taher koitawala  于2019年8月6日周二 上午12:28写道:
> > > > >
> > > > > > +1 for Approch 1 Point integration with each framework
> > > > > >
> > > > > > Approach 2 has a problem as you said "Developers need to think
> > about
> > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> > > this
> > > > > may
> > > > > > not be the panacea that it seems to be"
> > > > > >
> > > > > > We have seen various pipelines in the beam dag being expressed
> > > > > differently
> > > > > > then we had them in our original usecase. And also switching
> > between
> > > > > spark
> > > > > > and Flink runners in beam have various impact on the pipelines
> like
> > > > some
> > > > > > features available in Flink are not available on the spark runner
> > > etc.
> > > > > > Refer to this compatible matrix ->
> > > > > > https://beam.apache.org/documentation/runners/capability-matrix/
> > > > > >
> > > > > > Hence my vote on Approch 1 let's decouple and build the abstract
> > for
> > > > each
> > > > > > framework. That is a much better option. We will also have more
> > > control
> > > > > > over each framework's implement.
> > > > > >
> > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-07 Thread Vinoth Chandar
>>Are there some tasks I can take up to ramp up the code?
Certainly. There are some open tasks that touch the hoodie-client and
hoodie-utilities module.
https://issues.apache.org/jira/browse/HUDI-37
https://issues.apache.org/jira/browse/HUDI-194
https://issues.apache.org/jira/browse/HUDI-145
https://issues.apache.org/jira/browse/HUDI-130
https://issues.apache.org/jira/browse/HUDI-62

IMO, getting hands dirty with a few of these and may be 1-2 more involved
ones, would set enough context to drive the hudi-on-flink project.


On Tue, Aug 6, 2019 at 1:04 PM nishith agarwal  wrote:

> +1 for Approach 1 Point integration with each framework.
>
> Pros for point integration
> - Hudi community is already familiar with spark and spark based
> actions/shuffles etc. Since both modules can be decoupled, this enables us
> to have a steady release for Hudi for 1 execution engine (spark) while we
> hone our skills and iterate on making flink dag optimized, performant with
> the right configuration.
> - This might be a stepping stone towards rewriting the entire code base
> being agnostic of spark/flink. This approach will help us fix tests,
> intricacies and help make the code base ready for a larger rework.
> - Seems like the easiest way to add flink support
>
> Cons
> - More code paths to maintain and reason since the spark and flink
> integrations will naturally diverge over time.
>
> Theoretically, I do like the idea of being able to run the hudi dag on beam
> more than point integrations, where there is one API/logic to reason about.
> But practically, that may not be the right direction.
>
> Pros
> - Lesser cognitive burden in maintaining, evolving and releasing the
> project with one API to reason with.
> - Theoretically, going forward assuming beam is adopted as a standard
> programming paradigm for stream/batch, this would enable consumers leverage
> the power of hudi more easily.
>
> Cons
> - Massive rewrite of the code base. Additionally, since we would have moved
> away from directly using spark APIs, there is a bigger risk of regression.
> We would have to be very thorough with all the intricacies and ensure the
> same stability of new releases.
> - Managing future features (which may be very spark driven) will either
> clash or pause or will need to be reworked.
> - Tuning jobs for Spark/Flink type execution frameworks individually might
> be difficult and will get difficult over time as the project evolves, where
> some beam integrations with spark/flink may not work as expected.
> - Also, as pointed above, need to probably support the hoodie-spark module
> as a first-class.
>
> Thank,
> Nishith
>
>
> On Tue, Aug 6, 2019 at 9:48 AM taher koitawala  wrote:
>
> > Hi Vinoth,
> > Are there some tasks I can take up to ramp up the code? Want to
> get
> > more used to the code and understand the existing implementation better.
> >
> > Thanks,
> > Taher Koitawala
> >
> > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar  wrote:
> >
> > > Let's see if others have any thoughts as well. We can plan to fix the
> > > approach by EOW.
> > >
> > > On Mon, Aug 5, 2019 at 7:06 PM vino yang 
> wrote:
> > >
> > > > Hi guys,
> > > >
> > > > Also, +1 for Approach 1 like Taher.
> > > >
> > > > > If we can do a comprehensive analysis of this model and come up
> with.
> > > > means
> > > > > to refactor this cleanly, this would be promising.
> > > >
> > > > Yes, when we get the conclusion, we could start this work.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > >
> > > > taher koitawala  于2019年8月6日周二 上午12:28写道:
> > > >
> > > > > +1 for Approch 1 Point integration with each framework
> > > > >
> > > > > Approach 2 has a problem as you said "Developers need to think
> about
> > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> > this
> > > > may
> > > > > not be the panacea that it seems to be"
> > > > >
> > > > > We have seen various pipelines in the beam dag being expressed
> > > > differently
> > > > > then we had them in our original usecase. And also switching
> between
> > > > spark
> > > > > and Flink runners in beam have various impact on the pipelines like
> > > some
> > > > > features available in Flink are not available on the spark runner
> > etc.
> > > > > Refer to this compatible matrix ->
> > > > > https://beam.apache.org/documentation/runners/capability-matrix/
> > > > >
> > > > > Hence my vote on Approch 1 let's decouple and build the abstract
> for
> > > each
> > > > > framework. That is a much better option. We will also have more
> > control
> > > > > over each framework's implement.
> > > > >
> > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar 
> > wrote:
> > > > >
> > > > > > Would like to highlight that there are two distinct approaches
> here
> > > > with
> > > > > > different tradeoffs. Think of this as my braindump, as I have
> been
> > > > > thinking
> > > > > > about this quite a bit in the past.
> > > > > >
> > > > > >
> > > > > > *Approach 1 : Point integration with each 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread nishith agarwal
+1 for Approach 1 Point integration with each framework.

Pros for point integration
- Hudi community is already familiar with spark and spark based
actions/shuffles etc. Since both modules can be decoupled, this enables us
to have a steady release for Hudi for 1 execution engine (spark) while we
hone our skills and iterate on making flink dag optimized, performant with
the right configuration.
- This might be a stepping stone towards rewriting the entire code base
being agnostic of spark/flink. This approach will help us fix tests,
intricacies and help make the code base ready for a larger rework.
- Seems like the easiest way to add flink support

Cons
- More code paths to maintain and reason since the spark and flink
integrations will naturally diverge over time.

Theoretically, I do like the idea of being able to run the hudi dag on beam
more than point integrations, where there is one API/logic to reason about.
But practically, that may not be the right direction.

Pros
- Lesser cognitive burden in maintaining, evolving and releasing the
project with one API to reason with.
- Theoretically, going forward assuming beam is adopted as a standard
programming paradigm for stream/batch, this would enable consumers leverage
the power of hudi more easily.

Cons
- Massive rewrite of the code base. Additionally, since we would have moved
away from directly using spark APIs, there is a bigger risk of regression.
We would have to be very thorough with all the intricacies and ensure the
same stability of new releases.
- Managing future features (which may be very spark driven) will either
clash or pause or will need to be reworked.
- Tuning jobs for Spark/Flink type execution frameworks individually might
be difficult and will get difficult over time as the project evolves, where
some beam integrations with spark/flink may not work as expected.
- Also, as pointed above, need to probably support the hoodie-spark module
as a first-class.

Thank,
Nishith


On Tue, Aug 6, 2019 at 9:48 AM taher koitawala  wrote:

> Hi Vinoth,
> Are there some tasks I can take up to ramp up the code? Want to get
> more used to the code and understand the existing implementation better.
>
> Thanks,
> Taher Koitawala
>
> On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar  wrote:
>
> > Let's see if others have any thoughts as well. We can plan to fix the
> > approach by EOW.
> >
> > On Mon, Aug 5, 2019 at 7:06 PM vino yang  wrote:
> >
> > > Hi guys,
> > >
> > > Also, +1 for Approach 1 like Taher.
> > >
> > > > If we can do a comprehensive analysis of this model and come up with.
> > > means
> > > > to refactor this cleanly, this would be promising.
> > >
> > > Yes, when we get the conclusion, we could start this work.
> > >
> > > Best,
> > > Vino
> > >
> > >
> > > taher koitawala  于2019年8月6日周二 上午12:28写道:
> > >
> > > > +1 for Approch 1 Point integration with each framework
> > > >
> > > > Approach 2 has a problem as you said "Developers need to think about
> > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> this
> > > may
> > > > not be the panacea that it seems to be"
> > > >
> > > > We have seen various pipelines in the beam dag being expressed
> > > differently
> > > > then we had them in our original usecase. And also switching between
> > > spark
> > > > and Flink runners in beam have various impact on the pipelines like
> > some
> > > > features available in Flink are not available on the spark runner
> etc.
> > > > Refer to this compatible matrix ->
> > > > https://beam.apache.org/documentation/runners/capability-matrix/
> > > >
> > > > Hence my vote on Approch 1 let's decouple and build the abstract for
> > each
> > > > framework. That is a much better option. We will also have more
> control
> > > > over each framework's implement.
> > > >
> > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar 
> wrote:
> > > >
> > > > > Would like to highlight that there are two distinct approaches here
> > > with
> > > > > different tradeoffs. Think of this as my braindump, as I have been
> > > > thinking
> > > > > about this quite a bit in the past.
> > > > >
> > > > >
> > > > > *Approach 1 : Point integration with each framework *
> > > > >
> > > > > >>We may need a pure client module named for example
> > > > > hoodie-client-core(common)
> > > > > >> Then we could have: hoodie-client-spark, hoodie-client-flink
> > > > > and hoodie-client-beam
> > > > >
> > > > > (+) This is the safest to do IMO, since we can isolate the current
> > > Spark
> > > > > execution (hoodie-spark, hoodie-client-spark) from the changes for
> > > flink,
> > > > > while it stabilizes over few releases.
> > > > > (-) Downside is that the utilities needs to be redone :
> > > > >  hoodie-utilities-spark and hoodie-utilities-flink and
> > > > > hoodie-utilities-core ? hoodie-cli?
> > > > >
> > > > > If we can do a comprehensive analysis of this model and come up
> with.
> > > > means
> > > > > to refactor this cleanly, this would be promising.
> > > > 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread vbal...@apache.org
 
+1 on approach 1. As pointed out approach 2 has a risk for performance 
regression when introducing beam abstraction. To keep things simpler and start 
iterating, we can try an incremental route where beam can be thought of another 
engine supporting Hudi. When there is material confidence that there wont be 
any performance regression with Beam, we can start the unification effort. This 
sounds to be a more practical approach too as folks who have volunteered to 
spearhead this project have Flink experience.
Vino/Taher/Vinay,
Being new to Flink, I would start with identifying if there are any high level 
differences in guarantees and programming model between spark and Flink. 
You can do a parallel investigation while we are deciding on the module 
structure.  You could be looking at all the patterns in Hudi's Spark APIs usage 
(RDD/DataSource/SparkContext) and see if such support can be achieved in theory 
with Flink. If not, what is the workaround. Documenting such patterns would be 
valuable when multiple engineers are working on it. For e:g, Hudi relies on     
(a) custom partitioning logic for upserts,     (b) caching RDDs to avoid reruns 
of costly stages     (c) A Spark upsert task knowing its spark 
partition/task/attempt ids

Balaji.V
On Monday, August 5, 2019, 08:59:04 AM PDT, Vinoth Chandar 
 wrote:  
 
 Would like to highlight that there are two distinct approaches here with
different tradeoffs. Think of this as my braindump, as I have been thinking
about this quite a bit in the past.


*Approach 1 : Point integration with each framework *

>>We may need a pure client module named for example
hoodie-client-core(common)
>> Then we could have: hoodie-client-spark, hoodie-client-flink
and hoodie-client-beam

(+) This is the safest to do IMO, since we can isolate the current Spark
execution (hoodie-spark, hoodie-client-spark) from the changes for flink,
while it stabilizes over few releases.
(-) Downside is that the utilities needs to be redone :
 hoodie-utilities-spark and hoodie-utilities-flink and
hoodie-utilities-core ? hoodie-cli?

If we can do a comprehensive analysis of this model and come up with. means
to refactor this cleanly, this would be promising.


*Approach 2: Beam as the compute abstraction*

Another more drastic approach is to remove Spark as the compute abstraction
for writing data and replace it with Beam.

(+) All of the code remains more or less similar and there is one compute
API to reason about.

(-) The (very big) assumption here is that we are able to tune the spark
runtime the same way using Beam : custom partitioners, support for all RDD
operations we invoke, caching etc etc.
(-) It will be a massive rewrite and testing of such a large rewrite would
also be really challenging, since we need to pay attention to all intricate
details to ensure the spark users today experience no
regressions/side-effects
(-) Note that we still need to probably support the hoodie-spark module and
may be a first-class such integration with flink, for native flink/spark
pipeline authoring. Users of say DeltaStreamer need to pass in Spark or
Flink configs anyway..  Developers need to think about
what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may
not be the panacea that it seems to be.



One goal for the HIP is to get us all to agree as a community which one to
pick, with sufficient investigation, testing, benchmarking..

On Sat, Aug 3, 2019 at 7:56 PM vino yang  wrote:

> +1 for both Beam and Flink
>
> > First step here is to probably draw out current hierrarchy and figure out
> > what the abstraction points are..
> > In my opinion, the runtime (spark, flink) should be done at the
> > hoodie-client level and just used by hoodie-utilties seamlessly..
>
> +1 for Vinoth's opinion, it should be the first step.
>
> No matter we hope Hudi to integrate with which computing framework.
> We need to decouple Hudi client and Spark.
>
> We may need a pure client module named for example
> hoodie-client-core(common)
>
> Then we could have: hoodie-client-spark, hoodie-client-flink and
> hoodie-client-beam
>
> Suneel Marthi  于2019年8月4日周日 上午10:45写道:
>
> > +1 for Beam -- agree with Semantic Beeng's analysis.
> >
> > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala 
> > wrote:
> >
> > > So the way to go around this is that file a hip. Chalk all th classes
> our
> > > and start moving towards Pure client.
> > >
> > > Secondly should we want to try beam?
> > >
> > > I think there is to much going on here and I'm not able to follow. If
> we
> > > want to try out beam all along I don't think it makes sense to do
> > anything
> > > on Flink then.
> > >
> > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng 
> > > wrote:
> > >
> > >> +1 My money is on this approach.
> > >>
> > >> The existing abstractions from Beam seem enough for the use cases as I
> > >> imagine them.
> > >>
> > >> Flink also has "dynamic table", "table source" and "table sink" which
> > >> seem very useful abstractions 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread taher koitawala
Hi Vinoth,
Are there some tasks I can take up to ramp up the code? Want to get
more used to the code and understand the existing implementation better.

Thanks,
Taher Koitawala

On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar  wrote:

> Let's see if others have any thoughts as well. We can plan to fix the
> approach by EOW.
>
> On Mon, Aug 5, 2019 at 7:06 PM vino yang  wrote:
>
> > Hi guys,
> >
> > Also, +1 for Approach 1 like Taher.
> >
> > > If we can do a comprehensive analysis of this model and come up with.
> > means
> > > to refactor this cleanly, this would be promising.
> >
> > Yes, when we get the conclusion, we could start this work.
> >
> > Best,
> > Vino
> >
> >
> > taher koitawala  于2019年8月6日周二 上午12:28写道:
> >
> > > +1 for Approch 1 Point integration with each framework
> > >
> > > Approach 2 has a problem as you said "Developers need to think about
> > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this
> > may
> > > not be the panacea that it seems to be"
> > >
> > > We have seen various pipelines in the beam dag being expressed
> > differently
> > > then we had them in our original usecase. And also switching between
> > spark
> > > and Flink runners in beam have various impact on the pipelines like
> some
> > > features available in Flink are not available on the spark runner etc.
> > > Refer to this compatible matrix ->
> > > https://beam.apache.org/documentation/runners/capability-matrix/
> > >
> > > Hence my vote on Approch 1 let's decouple and build the abstract for
> each
> > > framework. That is a much better option. We will also have more control
> > > over each framework's implement.
> > >
> > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar  wrote:
> > >
> > > > Would like to highlight that there are two distinct approaches here
> > with
> > > > different tradeoffs. Think of this as my braindump, as I have been
> > > thinking
> > > > about this quite a bit in the past.
> > > >
> > > >
> > > > *Approach 1 : Point integration with each framework *
> > > >
> > > > >>We may need a pure client module named for example
> > > > hoodie-client-core(common)
> > > > >> Then we could have: hoodie-client-spark, hoodie-client-flink
> > > > and hoodie-client-beam
> > > >
> > > > (+) This is the safest to do IMO, since we can isolate the current
> > Spark
> > > > execution (hoodie-spark, hoodie-client-spark) from the changes for
> > flink,
> > > > while it stabilizes over few releases.
> > > > (-) Downside is that the utilities needs to be redone :
> > > >  hoodie-utilities-spark and hoodie-utilities-flink and
> > > > hoodie-utilities-core ? hoodie-cli?
> > > >
> > > > If we can do a comprehensive analysis of this model and come up with.
> > > means
> > > > to refactor this cleanly, this would be promising.
> > > >
> > > >
> > > > *Approach 2: Beam as the compute abstraction*
> > > >
> > > > Another more drastic approach is to remove Spark as the compute
> > > abstraction
> > > > for writing data and replace it with Beam.
> > > >
> > > > (+) All of the code remains more or less similar and there is one
> > compute
> > > > API to reason about.
> > > >
> > > > (-) The (very big) assumption here is that we are able to tune the
> > spark
> > > > runtime the same way using Beam : custom partitioners, support for
> all
> > > RDD
> > > > operations we invoke, caching etc etc.
> > > > (-) It will be a massive rewrite and testing of such a large rewrite
> > > would
> > > > also be really challenging, since we need to pay attention to all
> > > intricate
> > > > details to ensure the spark users today experience no
> > > > regressions/side-effects
> > > > (-) Note that we still need to probably support the hoodie-spark
> module
> > > and
> > > > may be a first-class such integration with flink, for native
> > flink/spark
> > > > pipeline authoring. Users of say DeltaStreamer need to pass in Spark
> or
> > > > Flink configs anyway..  Developers need to think about
> > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> this
> > > may
> > > > not be the panacea that it seems to be.
> > > >
> > > >
> > > >
> > > > One goal for the HIP is to get us all to agree as a community which
> one
> > > to
> > > > pick, with sufficient investigation, testing, benchmarking..
> > > >
> > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang 
> > wrote:
> > > >
> > > > > +1 for both Beam and Flink
> > > > >
> > > > > > First step here is to probably draw out current hierrarchy and
> > figure
> > > > out
> > > > > > what the abstraction points are..
> > > > > > In my opinion, the runtime (spark, flink) should be done at the
> > > > > > hoodie-client level and just used by hoodie-utilties seamlessly..
> > > > >
> > > > > +1 for Vinoth's opinion, it should be the first step.
> > > > >
> > > > > No matter we hope Hudi to integrate with which computing framework.
> > > > > We need to decouple Hudi client and Spark.
> > > > >
> > > > > We may need a pure client module named for example
> > > > > 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread Vinoth Chandar
Let's see if others have any thoughts as well. We can plan to fix the
approach by EOW.

On Mon, Aug 5, 2019 at 7:06 PM vino yang  wrote:

> Hi guys,
>
> Also, +1 for Approach 1 like Taher.
>
> > If we can do a comprehensive analysis of this model and come up with.
> means
> > to refactor this cleanly, this would be promising.
>
> Yes, when we get the conclusion, we could start this work.
>
> Best,
> Vino
>
>
> taher koitawala  于2019年8月6日周二 上午12:28写道:
>
> > +1 for Approch 1 Point integration with each framework
> >
> > Approach 2 has a problem as you said "Developers need to think about
> > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this
> may
> > not be the panacea that it seems to be"
> >
> > We have seen various pipelines in the beam dag being expressed
> differently
> > then we had them in our original usecase. And also switching between
> spark
> > and Flink runners in beam have various impact on the pipelines like some
> > features available in Flink are not available on the spark runner etc.
> > Refer to this compatible matrix ->
> > https://beam.apache.org/documentation/runners/capability-matrix/
> >
> > Hence my vote on Approch 1 let's decouple and build the abstract for each
> > framework. That is a much better option. We will also have more control
> > over each framework's implement.
> >
> > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar  wrote:
> >
> > > Would like to highlight that there are two distinct approaches here
> with
> > > different tradeoffs. Think of this as my braindump, as I have been
> > thinking
> > > about this quite a bit in the past.
> > >
> > >
> > > *Approach 1 : Point integration with each framework *
> > >
> > > >>We may need a pure client module named for example
> > > hoodie-client-core(common)
> > > >> Then we could have: hoodie-client-spark, hoodie-client-flink
> > > and hoodie-client-beam
> > >
> > > (+) This is the safest to do IMO, since we can isolate the current
> Spark
> > > execution (hoodie-spark, hoodie-client-spark) from the changes for
> flink,
> > > while it stabilizes over few releases.
> > > (-) Downside is that the utilities needs to be redone :
> > >  hoodie-utilities-spark and hoodie-utilities-flink and
> > > hoodie-utilities-core ? hoodie-cli?
> > >
> > > If we can do a comprehensive analysis of this model and come up with.
> > means
> > > to refactor this cleanly, this would be promising.
> > >
> > >
> > > *Approach 2: Beam as the compute abstraction*
> > >
> > > Another more drastic approach is to remove Spark as the compute
> > abstraction
> > > for writing data and replace it with Beam.
> > >
> > > (+) All of the code remains more or less similar and there is one
> compute
> > > API to reason about.
> > >
> > > (-) The (very big) assumption here is that we are able to tune the
> spark
> > > runtime the same way using Beam : custom partitioners, support for all
> > RDD
> > > operations we invoke, caching etc etc.
> > > (-) It will be a massive rewrite and testing of such a large rewrite
> > would
> > > also be really challenging, since we need to pay attention to all
> > intricate
> > > details to ensure the spark users today experience no
> > > regressions/side-effects
> > > (-) Note that we still need to probably support the hoodie-spark module
> > and
> > > may be a first-class such integration with flink, for native
> flink/spark
> > > pipeline authoring. Users of say DeltaStreamer need to pass in Spark or
> > > Flink configs anyway..  Developers need to think about
> > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this
> > may
> > > not be the panacea that it seems to be.
> > >
> > >
> > >
> > > One goal for the HIP is to get us all to agree as a community which one
> > to
> > > pick, with sufficient investigation, testing, benchmarking..
> > >
> > > On Sat, Aug 3, 2019 at 7:56 PM vino yang 
> wrote:
> > >
> > > > +1 for both Beam and Flink
> > > >
> > > > > First step here is to probably draw out current hierrarchy and
> figure
> > > out
> > > > > what the abstraction points are..
> > > > > In my opinion, the runtime (spark, flink) should be done at the
> > > > > hoodie-client level and just used by hoodie-utilties seamlessly..
> > > >
> > > > +1 for Vinoth's opinion, it should be the first step.
> > > >
> > > > No matter we hope Hudi to integrate with which computing framework.
> > > > We need to decouple Hudi client and Spark.
> > > >
> > > > We may need a pure client module named for example
> > > > hoodie-client-core(common)
> > > >
> > > > Then we could have: hoodie-client-spark, hoodie-client-flink and
> > > > hoodie-client-beam
> > > >
> > > > Suneel Marthi  于2019年8月4日周日 上午10:45写道:
> > > >
> > > > > +1 for Beam -- agree with Semantic Beeng's analysis.
> > > > >
> > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <
> taher...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > So the way to go around this is that file a hip. Chalk all th
> > classes
> > > > our
> > > > > > and start 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-05 Thread taher koitawala
+1 for Approch 1 Point integration with each framework

Approach 2 has a problem as you said "Developers need to think about
what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may
not be the panacea that it seems to be"

We have seen various pipelines in the beam dag being expressed differently
then we had them in our original usecase. And also switching between spark
and Flink runners in beam have various impact on the pipelines like some
features available in Flink are not available on the spark runner etc.
Refer to this compatible matrix ->
https://beam.apache.org/documentation/runners/capability-matrix/

Hence my vote on Approch 1 let's decouple and build the abstract for each
framework. That is a much better option. We will also have more control
over each framework's implement.

On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar  wrote:

> Would like to highlight that there are two distinct approaches here with
> different tradeoffs. Think of this as my braindump, as I have been thinking
> about this quite a bit in the past.
>
>
> *Approach 1 : Point integration with each framework *
>
> >>We may need a pure client module named for example
> hoodie-client-core(common)
> >> Then we could have: hoodie-client-spark, hoodie-client-flink
> and hoodie-client-beam
>
> (+) This is the safest to do IMO, since we can isolate the current Spark
> execution (hoodie-spark, hoodie-client-spark) from the changes for flink,
> while it stabilizes over few releases.
> (-) Downside is that the utilities needs to be redone :
>  hoodie-utilities-spark and hoodie-utilities-flink and
> hoodie-utilities-core ? hoodie-cli?
>
> If we can do a comprehensive analysis of this model and come up with. means
> to refactor this cleanly, this would be promising.
>
>
> *Approach 2: Beam as the compute abstraction*
>
> Another more drastic approach is to remove Spark as the compute abstraction
> for writing data and replace it with Beam.
>
> (+) All of the code remains more or less similar and there is one compute
> API to reason about.
>
> (-) The (very big) assumption here is that we are able to tune the spark
> runtime the same way using Beam : custom partitioners, support for all RDD
> operations we invoke, caching etc etc.
> (-) It will be a massive rewrite and testing of such a large rewrite would
> also be really challenging, since we need to pay attention to all intricate
> details to ensure the spark users today experience no
> regressions/side-effects
> (-) Note that we still need to probably support the hoodie-spark module and
> may be a first-class such integration with flink, for native flink/spark
> pipeline authoring. Users of say DeltaStreamer need to pass in Spark or
> Flink configs anyway..  Developers need to think about
> what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may
> not be the panacea that it seems to be.
>
>
>
> One goal for the HIP is to get us all to agree as a community which one to
> pick, with sufficient investigation, testing, benchmarking..
>
> On Sat, Aug 3, 2019 at 7:56 PM vino yang  wrote:
>
> > +1 for both Beam and Flink
> >
> > > First step here is to probably draw out current hierrarchy and figure
> out
> > > what the abstraction points are..
> > > In my opinion, the runtime (spark, flink) should be done at the
> > > hoodie-client level and just used by hoodie-utilties seamlessly..
> >
> > +1 for Vinoth's opinion, it should be the first step.
> >
> > No matter we hope Hudi to integrate with which computing framework.
> > We need to decouple Hudi client and Spark.
> >
> > We may need a pure client module named for example
> > hoodie-client-core(common)
> >
> > Then we could have: hoodie-client-spark, hoodie-client-flink and
> > hoodie-client-beam
> >
> > Suneel Marthi  于2019年8月4日周日 上午10:45写道:
> >
> > > +1 for Beam -- agree with Semantic Beeng's analysis.
> > >
> > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala 
> > > wrote:
> > >
> > > > So the way to go around this is that file a hip. Chalk all th classes
> > our
> > > > and start moving towards Pure client.
> > > >
> > > > Secondly should we want to try beam?
> > > >
> > > > I think there is to much going on here and I'm not able to follow. If
> > we
> > > > want to try out beam all along I don't think it makes sense to do
> > > anything
> > > > on Flink then.
> > > >
> > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng 
> > > > wrote:
> > > >
> > > >> +1 My money is on this approach.
> > > >>
> > > >> The existing abstractions from Beam seem enough for the use cases
> as I
> > > >> imagine them.
> > > >>
> > > >> Flink also has "dynamic table", "table source" and "table sink"
> which
> > > >> seem very useful abstractions where Hudi might fit nicely.
> > > >>
> > > >>
> > > >>
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > >>
> > > >>
> > > >> Attached a screen shot.
> > > >>
> > > >> This seems to fit with the original premise of Hudi as 

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-05 Thread Vinoth Chandar
Great discussions! Responded on the. original thread on decoupling..
Let's continue there?

On Mon, Aug 5, 2019 at 1:39 AM Semantic Beeng 
wrote:

> "design is more important. When we have a clear idea, it is not too late
> to create an issue"
>
> 100% with Vino
>
>
> On August 5, 2019 at 2:50 AM taher koitawala  wrote:
>
> Sounds good. Let's do that first.
>
> On Mon, Aug 5, 2019, 11:59 AM vino yang < yanghua1...@gmail.com> wrote:
>
> Hi Taher,
>
> IMO, Let's listen to more comments, after all, this discussion took place
> over the weekend. Then listen to Vinoth and the community's comments and
> suggestions.
>
> I personally think that design is more important. When we have a clear
> idea, it is not too late to create an issue.
>
> I am sorting out classes that depend on Spark. Maybe we can discuss how to
> decouple.
>
> What do you think?
>
> Best,
> Vino
>
> taher koitawala < taher...@gmail.com> 于2019年8月5日周一 下午2:17写道:
>
> If everyone agrees that we should decouple Hudi and Spark to enable
> processing engine abstraction. Should I open a jira ticket for that?
>
> On Sun, Aug 4, 2019 at 6:59 PM taher koitawala < taher...@gmail.com>
> wrote:
>
> If anyone wants to see a Flink Streaming pipeline here is a really small
> and basic Flink pipeline.
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example
>
> Consider users playing a game across multiple platforms and we only get
> the timestamp, username and the current score as the record. The pipelines
> has a custom source function which produces this stream record.
>
> The pipeline does aggregations(Sum score of current window with the total
> score of the user) every 2 seconds based on the event time attached with
> the record.
>
> User's score keeps increasing as new windows are fired and new outputs are
> emitted. That's where Hudi fits as per my vision now, where Hudi
> intelligently shows only the latest records written.
>
>
>
> On Sun, Aug 4, 2019, 6:43 PM taher koitawala < taher...@gmail.com> wrote:
>
> Fully agreed with Vino. I think let's chalk out the classes. Make
> hierarchies and start decoupling everything. Then we can move forward with
> the Flink and Beam streaming components.
>
> On Sun, Aug 4, 2019, 1:52 PM vino yang < yanghua1...@gmail.com> wrote:
>
> Hi Nick,
>
> Thank you for your more detailed thoughts, and I fully agree with your
> thoughts about HudiLink, which should also be part of the long-term
> planning of the Hudi Ecology.
>
>
> *But I found that the angle of our thinking and the starting point are not
> consistent. I pay more attention to the rationality of the existing
> architecture and whether the dependence on the computing engine is
> pluggable. Don't get me wrong, I know very well that although we have
> different perspectives, these views have value for Hudi.*
> Let me give more details on the discussion I made earlier.
>
> Currently, multiple submodules of the Hudi project are tightly coupled to
> Spark's design and dependencies. You can see that many of the class files
> contain statements such as "import org.apache.spark.xxx".
>
> I first put forward a discussion: "Integrate Hudi with Apache Flink", and
> then came up with a discussion: "Decouple Hudi and Spark".
>
> I think the word "Integrate" I used for the first discussion may not be
> accurate enough. My intention is to make the computing engine used by Hudi
> pluggable. Spark is equivalent to Hudi is just a library, it is not the
> core of Hudi, it should not be strongly coupled with Hudi. The features
> currently provided by Spark are also available from Flink. But in order to
> achieve this, we need to decouple Hudi from the code level with the use of
> Spark.
>
> This makes sense both in terms of structural rationality and community
> ecology.
>
> Best,
> Vino
>
>
> Semantic Beeng < n...@semanticbeeng.com> 于2019年8月4日周日 下午2:21写道:
>
> "+1 for both Beam and Flink" - what I propose implies this indeed.
>
> But/and am working from the desired functionality and a proposed design.
>
> (as opposed to starting with refactoring Hudi with the goal of close
> integration with Flink)
>
> I feel this is not necessary - but am not an expert in Hudi implementation.
>
> But am pretty sure it is not sufficient for the use cases I have in mind.
> The gist is using Hudi as a file based data lake + ML feature store that
> enables incremental analyses done with a combination of Flink, Beam, Spark,
> Tensorlflow (see Petastorm from UberEng for an idea.)
>
> Let us call this HudiLink from now on (think of it as a mediator, not
> another Hudi).
>
> The intuition behind looking at more then Flink is that both Beam and
> Flink have good design abstractions we might reuse and extend.
>
> Like I said before, do not believe in point to point integrations.
>
> Alternatively / in parallel,If you care to share your use cases it would
> be very useful. Working with explicit use cases helps others to relate and
> help.
>
> Also, if some of 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-05 Thread Vinoth Chandar
Would like to highlight that there are two distinct approaches here with
different tradeoffs. Think of this as my braindump, as I have been thinking
about this quite a bit in the past.


*Approach 1 : Point integration with each framework *

>>We may need a pure client module named for example
hoodie-client-core(common)
>> Then we could have: hoodie-client-spark, hoodie-client-flink
and hoodie-client-beam

(+) This is the safest to do IMO, since we can isolate the current Spark
execution (hoodie-spark, hoodie-client-spark) from the changes for flink,
while it stabilizes over few releases.
(-) Downside is that the utilities needs to be redone :
 hoodie-utilities-spark and hoodie-utilities-flink and
hoodie-utilities-core ? hoodie-cli?

If we can do a comprehensive analysis of this model and come up with. means
to refactor this cleanly, this would be promising.


*Approach 2: Beam as the compute abstraction*

Another more drastic approach is to remove Spark as the compute abstraction
for writing data and replace it with Beam.

(+) All of the code remains more or less similar and there is one compute
API to reason about.

(-) The (very big) assumption here is that we are able to tune the spark
runtime the same way using Beam : custom partitioners, support for all RDD
operations we invoke, caching etc etc.
(-) It will be a massive rewrite and testing of such a large rewrite would
also be really challenging, since we need to pay attention to all intricate
details to ensure the spark users today experience no
regressions/side-effects
(-) Note that we still need to probably support the hoodie-spark module and
may be a first-class such integration with flink, for native flink/spark
pipeline authoring. Users of say DeltaStreamer need to pass in Spark or
Flink configs anyway..  Developers need to think about
what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may
not be the panacea that it seems to be.



One goal for the HIP is to get us all to agree as a community which one to
pick, with sufficient investigation, testing, benchmarking..

On Sat, Aug 3, 2019 at 7:56 PM vino yang  wrote:

> +1 for both Beam and Flink
>
> > First step here is to probably draw out current hierrarchy and figure out
> > what the abstraction points are..
> > In my opinion, the runtime (spark, flink) should be done at the
> > hoodie-client level and just used by hoodie-utilties seamlessly..
>
> +1 for Vinoth's opinion, it should be the first step.
>
> No matter we hope Hudi to integrate with which computing framework.
> We need to decouple Hudi client and Spark.
>
> We may need a pure client module named for example
> hoodie-client-core(common)
>
> Then we could have: hoodie-client-spark, hoodie-client-flink and
> hoodie-client-beam
>
> Suneel Marthi  于2019年8月4日周日 上午10:45写道:
>
> > +1 for Beam -- agree with Semantic Beeng's analysis.
> >
> > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala 
> > wrote:
> >
> > > So the way to go around this is that file a hip. Chalk all th classes
> our
> > > and start moving towards Pure client.
> > >
> > > Secondly should we want to try beam?
> > >
> > > I think there is to much going on here and I'm not able to follow. If
> we
> > > want to try out beam all along I don't think it makes sense to do
> > anything
> > > on Flink then.
> > >
> > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng 
> > > wrote:
> > >
> > >> +1 My money is on this approach.
> > >>
> > >> The existing abstractions from Beam seem enough for the use cases as I
> > >> imagine them.
> > >>
> > >> Flink also has "dynamic table", "table source" and "table sink" which
> > >> seem very useful abstractions where Hudi might fit nicely.
> > >>
> > >>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > >>
> > >>
> > >> Attached a screen shot.
> > >>
> > >> This seems to fit with the original premise of Hudi as well.
> > >>
> > >> Am exploring this venue with a use case that involves "temporal joins
> on
> > >> streams" which I need for feature extraction.
> > >>
> > >> Anyone is interested in this or has concrete enough needs and use
> cases
> > >> please let me know.
> > >>
> > >> Best to go from an agreed upon set of 2-3 use cases.
> > >>
> > >> Cheers
> > >>
> > >> Nick
> > >>
> > >>
> > >> > Also, we do have some Beam experts on the mailing list.. Can you
> > please
> > >> weigh on viability of using Beam as the intermediate abstraction here
> > >> between Spark/Flink?
> > >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> > >> reduceByKey, countByKey and also does custom partitioning a lot.>
> > >>
> > >> >
> > >>
> > >
> >
>


Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-05 Thread vino yang
Hi Taher,

IMO, Let's listen to more comments, after all, this discussion took place
over the weekend. Then listen to Vinoth and the community's comments and
suggestions.

I personally think that design is more important. When we have a clear
idea, it is not too late to create an issue.

I am sorting out classes that depend on Spark. Maybe we can discuss how to
decouple.

What do you think?

Best,
Vino

taher koitawala  于2019年8月5日周一 下午2:17写道:

> If everyone agrees that we should decouple Hudi and Spark to enable
> processing engine abstraction. Should I open a jira ticket for that?
>
> On Sun, Aug 4, 2019 at 6:59 PM taher koitawala  wrote:
>
>> If anyone wants to see a Flink Streaming pipeline here is a really small
>> and basic Flink pipeline.
>> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example
>>
>> Consider users playing a game across multiple platforms and we only get
>> the timestamp, username and the current score as the record. The pipelines
>> has a custom source function which produces this stream record.
>>
>> The pipeline does aggregations(Sum score of current window with the total
>> score of the user) every 2 seconds based on the event time attached with
>> the record.
>>
>> User's score keeps increasing as new windows are fired and new outputs
>> are emitted. That's where Hudi fits as per my vision now, where Hudi
>> intelligently shows only the latest records written.
>>
>>
>>
>> On Sun, Aug 4, 2019, 6:43 PM taher koitawala  wrote:
>>
>>> Fully agreed with Vino. I think let's chalk out the classes. Make
>>> hierarchies and start decoupling everything. Then we can move forward with
>>> the Flink and Beam streaming components.
>>>
>>> On Sun, Aug 4, 2019, 1:52 PM vino yang  wrote:
>>>
 Hi Nick,

 Thank you for your more detailed thoughts, and I fully agree with your
 thoughts about HudiLink, which should also be part of the long-term
 planning of the Hudi Ecology.


 *But I found that the angle of our thinking and the starting point are
 not consistent. I pay more attention to the rationality of the existing
 architecture and whether the dependence on the computing engine is
 pluggable. Don't get me wrong, I know very well that although we have
 different perspectives, these views have value for Hudi.*
 Let me give more details on the discussion I made earlier.

 Currently, multiple submodules of the Hudi project are tightly coupled
 to Spark's design and dependencies. You can see that many of the class
 files contain statements such as "import org.apache.spark.xxx".

 I first put forward a discussion: "Integrate Hudi with Apache Flink",
 and then came up with a discussion: "Decouple Hudi and Spark".

 I think the word "Integrate" I used for the first discussion may not be
 accurate enough. My intention is to make the computing engine used by Hudi
 pluggable. Spark is equivalent to Hudi is just a library, it is not the
 core of Hudi, it should not be strongly coupled with Hudi. The features
 currently provided by Spark are also available from Flink. But in order to
 achieve this, we need to decouple Hudi from the code level with the use of
 Spark.

 This makes sense both in terms of structural rationality and community
 ecology.

 Best,
 Vino


 Semantic Beeng  于2019年8月4日周日 下午2:21写道:

> "+1 for both Beam and Flink" - what I propose implies this indeed.
>
> But/and am working from the desired functionality and a proposed
> design.
>
> (as opposed to starting with refactoring Hudi with the goal of close
> integration with Flink)
>
> I feel this is not necessary - but am not an expert in Hudi
> implementation.
>
> But am pretty sure it is not sufficient for the use cases I have in
> mind. The gist is using Hudi as a file based data lake + ML feature store
> that enables incremental analyses done with a combination of Flink, Beam,
> Spark, Tensorlflow (see Petastorm from UberEng for an idea.)
>
> Let us call this HudiLink from now on (think of it as a mediator, not
> another Hudi).
>
> The intuition behind looking at more then Flink is that both Beam and
> Flink have good design abstractions we might reuse and extend.
>
> Like I said before, do not believe in point to point integrations.
>
> Alternatively / in parallel,If you care to share your use cases it
> would be very useful. Working with explicit use cases helps others to
> relate and help.
>
> Also, if some of you know there believe in (see) value of refactoring
> Hudi implementation for a hard integration with Flink (but have no time to
> argue for it) ofc you please go ahead.
>
> That may be a valid bottom up approach but I cannot relate to it
> myself (due to lack of use cases).
>
> Working 

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-05 Thread taher koitawala
If everyone agrees that we should decouple Hudi and Spark to enable
processing engine abstraction. Should I open a jira ticket for that?

On Sun, Aug 4, 2019 at 6:59 PM taher koitawala  wrote:

> If anyone wants to see a Flink Streaming pipeline here is a really small
> and basic Flink pipeline.
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example
>
> Consider users playing a game across multiple platforms and we only get
> the timestamp, username and the current score as the record. The pipelines
> has a custom source function which produces this stream record.
>
> The pipeline does aggregations(Sum score of current window with the total
> score of the user) every 2 seconds based on the event time attached with
> the record.
>
> User's score keeps increasing as new windows are fired and new outputs are
> emitted. That's where Hudi fits as per my vision now, where Hudi
> intelligently shows only the latest records written.
>
>
>
> On Sun, Aug 4, 2019, 6:43 PM taher koitawala  wrote:
>
>> Fully agreed with Vino. I think let's chalk out the classes. Make
>> hierarchies and start decoupling everything. Then we can move forward with
>> the Flink and Beam streaming components.
>>
>> On Sun, Aug 4, 2019, 1:52 PM vino yang  wrote:
>>
>>> Hi Nick,
>>>
>>> Thank you for your more detailed thoughts, and I fully agree with your
>>> thoughts about HudiLink, which should also be part of the long-term
>>> planning of the Hudi Ecology.
>>>
>>>
>>> *But I found that the angle of our thinking and the starting point are
>>> not consistent. I pay more attention to the rationality of the existing
>>> architecture and whether the dependence on the computing engine is
>>> pluggable. Don't get me wrong, I know very well that although we have
>>> different perspectives, these views have value for Hudi.*
>>> Let me give more details on the discussion I made earlier.
>>>
>>> Currently, multiple submodules of the Hudi project are tightly coupled
>>> to Spark's design and dependencies. You can see that many of the class
>>> files contain statements such as "import org.apache.spark.xxx".
>>>
>>> I first put forward a discussion: "Integrate Hudi with Apache Flink",
>>> and then came up with a discussion: "Decouple Hudi and Spark".
>>>
>>> I think the word "Integrate" I used for the first discussion may not be
>>> accurate enough. My intention is to make the computing engine used by Hudi
>>> pluggable. Spark is equivalent to Hudi is just a library, it is not the
>>> core of Hudi, it should not be strongly coupled with Hudi. The features
>>> currently provided by Spark are also available from Flink. But in order to
>>> achieve this, we need to decouple Hudi from the code level with the use of
>>> Spark.
>>>
>>> This makes sense both in terms of structural rationality and community
>>> ecology.
>>>
>>> Best,
>>> Vino
>>>
>>>
>>> Semantic Beeng  于2019年8月4日周日 下午2:21写道:
>>>
 "+1 for both Beam and Flink" - what I propose implies this indeed.

 But/and am working from the desired functionality and a proposed design.

 (as opposed to starting with refactoring Hudi with the goal of close
 integration with Flink)

 I feel this is not necessary - but am not an expert in Hudi
 implementation.

 But am pretty sure it is not sufficient for the use cases I have in
 mind. The gist is using Hudi as a file based data lake + ML feature store
 that enables incremental analyses done with a combination of Flink, Beam,
 Spark, Tensorlflow (see Petastorm from UberEng for an idea.)

 Let us call this HudiLink from now on (think of it as a mediator, not
 another Hudi).

 The intuition behind looking at more then Flink is that both Beam and
 Flink have good design abstractions we might reuse and extend.

 Like I said before, do not believe in point to point integrations.

 Alternatively / in parallel,If you care to share your use cases it
 would be very useful. Working with explicit use cases helps others to
 relate and help.

 Also, if some of you know there believe in (see) value of refactoring
 Hudi implementation for a hard integration with Flink (but have no time to
 argue for it) ofc you please go ahead.

 That may be a valid bottom up approach but I cannot relate to it myself
 (due to lack of use cases).

 Working on a material on HudiLink - if any are interested I might
 publish when more mature.

 Hint: this was part of the inspiration
 https://eng.uber.com/michelangelo/

 One well thought use case will get you "in". :-) Kidding, ofc.

 Cheers

 Nick


 On August 3, 2019 at 10:55 PM vino yang  wrote:


 +1 for both Beam and Flink

 First step here is to probably draw out current hierrarchy and figure
 out
 what the abstraction points are..
 In my opinion, the runtime (spark, flink) 

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-04 Thread taher koitawala
If anyone wants to see a Flink Streaming pipeline here is a really small
and basic Flink pipeline.
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example

Consider users playing a game across multiple platforms and we only get the
timestamp, username and the current score as the record. The pipelines has
a custom source function which produces this stream record.

The pipeline does aggregations(Sum score of current window with the total
score of the user) every 2 seconds based on the event time attached with
the record.

User's score keeps increasing as new windows are fired and new outputs are
emitted. That's where Hudi fits as per my vision now, where Hudi
intelligently shows only the latest records written.



On Sun, Aug 4, 2019, 6:43 PM taher koitawala  wrote:

> Fully agreed with Vino. I think let's chalk out the classes. Make
> hierarchies and start decoupling everything. Then we can move forward with
> the Flink and Beam streaming components.
>
> On Sun, Aug 4, 2019, 1:52 PM vino yang  wrote:
>
>> Hi Nick,
>>
>> Thank you for your more detailed thoughts, and I fully agree with your
>> thoughts about HudiLink, which should also be part of the long-term
>> planning of the Hudi Ecology.
>>
>>
>> *But I found that the angle of our thinking and the starting point are
>> not consistent. I pay more attention to the rationality of the existing
>> architecture and whether the dependence on the computing engine is
>> pluggable. Don't get me wrong, I know very well that although we have
>> different perspectives, these views have value for Hudi.*
>> Let me give more details on the discussion I made earlier.
>>
>> Currently, multiple submodules of the Hudi project are tightly coupled to
>> Spark's design and dependencies. You can see that many of the class files
>> contain statements such as "import org.apache.spark.xxx".
>>
>> I first put forward a discussion: "Integrate Hudi with Apache Flink", and
>> then came up with a discussion: "Decouple Hudi and Spark".
>>
>> I think the word "Integrate" I used for the first discussion may not be
>> accurate enough. My intention is to make the computing engine used by Hudi
>> pluggable. Spark is equivalent to Hudi is just a library, it is not the
>> core of Hudi, it should not be strongly coupled with Hudi. The features
>> currently provided by Spark are also available from Flink. But in order to
>> achieve this, we need to decouple Hudi from the code level with the use of
>> Spark.
>>
>> This makes sense both in terms of structural rationality and community
>> ecology.
>>
>> Best,
>> Vino
>>
>>
>> Semantic Beeng  于2019年8月4日周日 下午2:21写道:
>>
>>> "+1 for both Beam and Flink" - what I propose implies this indeed.
>>>
>>> But/and am working from the desired functionality and a proposed design.
>>>
>>> (as opposed to starting with refactoring Hudi with the goal of close
>>> integration with Flink)
>>>
>>> I feel this is not necessary - but am not an expert in Hudi
>>> implementation.
>>>
>>> But am pretty sure it is not sufficient for the use cases I have in
>>> mind. The gist is using Hudi as a file based data lake + ML feature store
>>> that enables incremental analyses done with a combination of Flink, Beam,
>>> Spark, Tensorlflow (see Petastorm from UberEng for an idea.)
>>>
>>> Let us call this HudiLink from now on (think of it as a mediator, not
>>> another Hudi).
>>>
>>> The intuition behind looking at more then Flink is that both Beam and
>>> Flink have good design abstractions we might reuse and extend.
>>>
>>> Like I said before, do not believe in point to point integrations.
>>>
>>> Alternatively / in parallel,If you care to share your use cases it would
>>> be very useful. Working with explicit use cases helps others to relate and
>>> help.
>>>
>>> Also, if some of you know there believe in (see) value of refactoring
>>> Hudi implementation for a hard integration with Flink (but have no time to
>>> argue for it) ofc you please go ahead.
>>>
>>> That may be a valid bottom up approach but I cannot relate to it myself
>>> (due to lack of use cases).
>>>
>>> Working on a material on HudiLink - if any are interested I might
>>> publish when more mature.
>>>
>>> Hint: this was part of the inspiration
>>> https://eng.uber.com/michelangelo/
>>>
>>> One well thought use case will get you "in". :-) Kidding, ofc.
>>>
>>> Cheers
>>>
>>> Nick
>>>
>>>
>>> On August 3, 2019 at 10:55 PM vino yang  wrote:
>>>
>>>
>>> +1 for both Beam and Flink
>>>
>>> First step here is to probably draw out current hierrarchy and figure out
>>> what the abstraction points are..
>>> In my opinion, the runtime (spark, flink) should be done at the
>>> hoodie-client level and just used by hoodie-utilties seamlessly..
>>>
>>>
>>> +1 for Vinoth's opinion, it should be the first step.
>>>
>>> No matter we hope Hudi to integrate with which computing framework.
>>> We need to decouple Hudi client and Spark.
>>>
>>> We may need a pure client module named 

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-04 Thread vino yang
Hi Nick,

Thank you for your more detailed thoughts, and I fully agree with your
thoughts about HudiLink, which should also be part of the long-term
planning of the Hudi Ecology.


*But I found that the angle of our thinking and the starting point are not
consistent. I pay more attention to the rationality of the existing
architecture and whether the dependence on the computing engine is
pluggable. Don't get me wrong, I know very well that although we have
different perspectives, these views have value for Hudi.*
Let me give more details on the discussion I made earlier.

Currently, multiple submodules of the Hudi project are tightly coupled to
Spark's design and dependencies. You can see that many of the class files
contain statements such as "import org.apache.spark.xxx".

I first put forward a discussion: "Integrate Hudi with Apache Flink", and
then came up with a discussion: "Decouple Hudi and Spark".

I think the word "Integrate" I used for the first discussion may not be
accurate enough. My intention is to make the computing engine used by Hudi
pluggable. Spark is equivalent to Hudi is just a library, it is not the
core of Hudi, it should not be strongly coupled with Hudi. The features
currently provided by Spark are also available from Flink. But in order to
achieve this, we need to decouple Hudi from the code level with the use of
Spark.

This makes sense both in terms of structural rationality and community
ecology.

Best,
Vino


Semantic Beeng  于2019年8月4日周日 下午2:21写道:

> "+1 for both Beam and Flink" - what I propose implies this indeed.
>
> But/and am working from the desired functionality and a proposed design.
>
> (as opposed to starting with refactoring Hudi with the goal of close
> integration with Flink)
>
> I feel this is not necessary - but am not an expert in Hudi implementation.
>
> But am pretty sure it is not sufficient for the use cases I have in mind.
> The gist is using Hudi as a file based data lake + ML feature store that
> enables incremental analyses done with a combination of Flink, Beam, Spark,
> Tensorlflow (see Petastorm from UberEng for an idea.)
>
> Let us call this HudiLink from now on (think of it as a mediator, not
> another Hudi).
>
> The intuition behind looking at more then Flink is that both Beam and
> Flink have good design abstractions we might reuse and extend.
>
> Like I said before, do not believe in point to point integrations.
>
> Alternatively / in parallel,If you care to share your use cases it would
> be very useful. Working with explicit use cases helps others to relate and
> help.
>
> Also, if some of you know there believe in (see) value of refactoring Hudi
> implementation for a hard integration with Flink (but have no time to argue
> for it) ofc you please go ahead.
>
> That may be a valid bottom up approach but I cannot relate to it myself
> (due to lack of use cases).
>
> Working on a material on HudiLink - if any are interested I might publish
> when more mature.
>
> Hint: this was part of the inspiration https://eng.uber.com/michelangelo/
>
> One well thought use case will get you "in". :-) Kidding, ofc.
>
> Cheers
>
> Nick
>
>
> On August 3, 2019 at 10:55 PM vino yang  wrote:
>
>
> +1 for both Beam and Flink
>
> First step here is to probably draw out current hierrarchy and figure out
> what the abstraction points are..
> In my opinion, the runtime (spark, flink) should be done at the
> hoodie-client level and just used by hoodie-utilties seamlessly..
>
>
> +1 for Vinoth's opinion, it should be the first step.
>
> No matter we hope Hudi to integrate with which computing framework.
> We need to decouple Hudi client and Spark.
>
> We may need a pure client module named for example
> hoodie-client-core(common)
>
> Then we could have: hoodie-client-spark, hoodie-client-flink and
> hoodie-client-beam
>
> Suneel Marthi  于2019年8月4日周日 上午10:45写道:
>
> +1 for Beam -- agree with Semantic Beeng's analysis.
>
> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala 
> wrote:
>
> So the way to go around this is that file a hip. Chalk all th classes our
> and start moving towards Pure client.
>
> Secondly should we want to try beam?
>
> I think there is to much going on here and I'm not able to follow. If we
> want to try out beam all along I don't think it makes sense to do anything
> on Flink then.
>
> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng 
> wrote:
>
> >> +1 My money is on this approach.
> >>
> >> The existing abstractions from Beam seem enough for the use cases as I
> >> imagine them.
> >>
> >> Flink also has "dynamic table", "table source" and "table sink" which
> >> seem very useful abstractions where Hudi might fit nicely.
> >>
> >>
> >>
>
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> >>
> >>
> >> Attached a screen shot.
> >>
> >> This seems to fit with the original premise of Hudi as well.
> >>
> >> Am exploring this venue with a use case that involves "temporal joins on
> >> streams" which I need for 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-03 Thread Suneel Marthi
+1 for Beam -- agree with Semantic Beeng's analysis.

On Sat, Aug 3, 2019 at 10:30 PM taher koitawala  wrote:

> So the way to go around this is that file a hip. Chalk all th classes our
> and start moving towards Pure client.
>
> Secondly should we want to try beam?
>
> I think there is to much going on here and I'm not able to follow. If we
> want to try out beam all along I don't think it makes sense to do anything
> on Flink then.
>
> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng 
> wrote:
>
>> +1 My money is on this approach.
>>
>> The existing abstractions from Beam seem enough for the use cases as I
>> imagine them.
>>
>> Flink also has "dynamic table", "table source" and "table sink" which
>> seem very useful abstractions where Hudi might fit nicely.
>>
>>
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>>
>>
>> Attached a screen shot.
>>
>> This seems to fit with the original premise of Hudi as well.
>>
>> Am exploring this venue with a use case that involves "temporal joins on
>> streams" which I need for feature extraction.
>>
>> Anyone is interested in this or has concrete enough needs and use cases
>> please let me know.
>>
>> Best to go from an agreed upon set of 2-3 use cases.
>>
>> Cheers
>>
>> Nick
>>
>>
>> > Also, we do have some Beam experts on the mailing list.. Can you please
>> weigh on viability of using Beam as the intermediate abstraction here
>> between Spark/Flink?
>> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
>> reduceByKey, countByKey and also does custom partitioning a lot.>
>>
>> >
>>
>


Re: [DISCUSS] Decouple Hudi and Spark

2019-08-03 Thread taher koitawala
So the way to go around this is that file a hip. Chalk all th classes our
and start moving towards Pure client.

Secondly should we want to try beam?

I think there is to much going on here and I'm not able to follow. If we
want to try out beam all along I don't think it makes sense to do anything
on Flink then.

On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng  wrote:

> +1 My money is on this approach.
>
> The existing abstractions from Beam seem enough for the use cases as I
> imagine them.
>
> Flink also has "dynamic table", "table source" and "table sink" which seem
> very useful abstractions where Hudi might fit nicely.
>
>
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
>
>
> Attached a screen shot.
>
> This seems to fit with the original premise of Hudi as well.
>
> Am exploring this venue with a use case that involves "temporal joins on
> streams" which I need for feature extraction.
>
> Anyone is interested in this or has concrete enough needs and use cases
> please let me know.
>
> Best to go from an agreed upon set of 2-3 use cases.
>
> Cheers
>
> Nick
>
>
> > Also, we do have some Beam experts on the mailing list.. Can you please
> weigh on viability of using Beam as the intermediate abstraction here
> between Spark/Flink?
> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> reduceByKey, countByKey and also does custom partitioning a lot.>
>
> >
>


Re: [DISCUSS] Decouple Hudi and Spark

2019-08-03 Thread Vinoth Chandar
>>More for my own edification, how does the recently introduced
timeline service play into the delta writer components?

TimelineService runs in the Spark driver (DeltaStreamer is a Hudi Spark
app) and answers metadata/timeline api calls from the executors.. it is not
aware of Spark vs Flink or any runtime stuff.

On Sat, Aug 3, 2019 at 12:50 PM Vinoth Chandar <
mail.vinoth.chan...@gmail.com> wrote:

> Decoupling Spark and Hudi is the first step to bring in a Flink runtime,
> and its also the hardest part.
>
> On the decoupling itself, the IOHandle classes are (almost) unaware of
> Spark itself, where the Write/ReadClient and the Table classes are very
> aware..
> First step here is to probably draw out current hierrarchy and figure out
> what the abstraction points are..
> In my opinion, the runtime (spark, flink) should be done at the
> hoodie-client level and just used by hoodie-utilties seamlessly..
>
> My 2c for folks working on this is to may be pick up few bugs/issues
> across these areas to get more familiarity with code and then draw up the
> proposals..
> (not a requirement, but will build more understanding of all
> devils-in-the-details)
>
> >>Not sure if this requires a HIP to drive.
> I think this definitely needs a HIP. Its a large enough change :)
>
> Also, we do have some Beam experts on the mailing list.. Can you please
> weigh on viability of using Beam as the intermediate abstraction here
> between Spark/Flink?
> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> reduceByKey, countByKey and also does custom partitioning a lot.
>
>
>
>
> On Fri, Aug 2, 2019 at 9:46 AM Aaron Langford 
> wrote:
>
>> More for my own edification, how does the recently introduced timeline
>> service play into the delta writer components?
>>
>> On Fri, Aug 2, 2019 at 7:53 AM vino yang  wrote:
>>
>> > Hi Suneel,
>> >
>> > Thank you for your suggestion, let me clarify.
>> >
>> >
>> > *The context of this email is that we are evaluating how to implement a
>> > Stream Delta writer base on Flink.*
>> > About the discussion between me, Taher and Vinay, those are just some
>> > trivial details in the preparation of the document, and the discussion
>> is
>> > also based on mail.
>> >
>> > When we don't have the first draft, discussing the details on the
>> mailing
>> > list may confuse others and easily deviate from the topic. Our initial
>> plan
>> > was to facilitate community discussions and reviews when we had a draft
>> of
>> > the documentation available to the community.
>> >
>> > Best,
>> > Vino
>> >
>> > Suneel Marthi  于2019年8月2日周五 下午10:37写道:
>> >
>> > > Please keep all discussions to Mailing lists here - no offline
>> > discussions
>> > > please.
>> > >
>> > > On Fri, Aug 2, 2019 at 10:22 AM vino yang 
>> wrote:
>> > >
>> > > > Hi guys,
>> > > >
>> > > > Currently, I, Taher and Vinay are working on issue HUDI-184.[1]
>> > > >
>> > > > As a first step, we are discussing the design doc.
>> > > >
>> > > > After diving into the code, We listed some relevant classes about
>> the
>> > > Spark
>> > > > delta writer.
>> > > >
>> > > >- module: hoodie-utilities
>> > > >
>> > > > com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
>> > > > com.uber.hoodie.utilities.deltastreamer.DeltaSyncService
>> > > > com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter
>> > > > com.uber.hoodie.utilities.schema.SchemaProvider
>> > > > com.uber.hoodie.utilities.transform.Transformer
>> > > >
>> > > >- module: hoodie-client
>> > > >
>> > > > com.uber.hoodie.HoodieWriteClient (to commit compaction)
>> > > >
>> > > >
>> > > > The fact is *hoodie-utilities* depends on *hoodie-client*, however,
>> > > > *hoodie-client* is also not a pure Hudi component, it also depends
>> on
>> > > Spark
>> > > > lib.
>> > > >
>> > > > So I propose hoodie should provide a pure hoodie-client and decouple
>> > with
>> > > > Spark. Then Flink and Spark modules should depend on it.
>> > > >
>> > > > Moreover, based on the old discussion[2], we all agree that Spark is
>> > not
>> > > > the only choice for Hudi, it could also be Flink/Beam.
>> > > >
>> > > > IMO, We should decouple Hudi from Spark at the height of the
>> project,
>> > > > including but not limited to module splitting and renaming.
>> > > >
>> > > > Not sure if this requires a HIP to drive.
>> > > >
>> > > > We should first listen to the opinions of the community. Any ideas
>> and
>> > > > suggestions are welcome and appreciated.
>> > > >
>> > > > Best,
>> > > > Vino
>> > > >
>> > > > [1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1
>> > > > [2]:
>> > > >
>> > > >
>> > >
>> >
>> https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E
>> > > >
>> > >
>> >
>>
>


Re: [DISCUSS] Decouple Hudi and Spark

2019-08-03 Thread Vinoth Chandar
Decoupling Spark and Hudi is the first step to bring in a Flink runtime,
and its also the hardest part.

On the decoupling itself, the IOHandle classes are (almost) unaware of
Spark itself, where the Write/ReadClient and the Table classes are very
aware..
First step here is to probably draw out current hierrarchy and figure out
what the abstraction points are..
In my opinion, the runtime (spark, flink) should be done at the
hoodie-client level and just used by hoodie-utilties seamlessly..

My 2c for folks working on this is to may be pick up few bugs/issues across
these areas to get more familiarity with code and then draw up the
proposals..
(not a requirement, but will build more understanding of all
devils-in-the-details)

>>Not sure if this requires a HIP to drive.
I think this definitely needs a HIP. Its a large enough change :)

Also, we do have some Beam experts on the mailing list.. Can you please
weigh on viability of using Beam as the intermediate abstraction here
between Spark/Flink?
Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
reduceByKey, countByKey and also does custom partitioning a lot.




On Fri, Aug 2, 2019 at 9:46 AM Aaron Langford 
wrote:

> More for my own edification, how does the recently introduced timeline
> service play into the delta writer components?
>
> On Fri, Aug 2, 2019 at 7:53 AM vino yang  wrote:
>
> > Hi Suneel,
> >
> > Thank you for your suggestion, let me clarify.
> >
> >
> > *The context of this email is that we are evaluating how to implement a
> > Stream Delta writer base on Flink.*
> > About the discussion between me, Taher and Vinay, those are just some
> > trivial details in the preparation of the document, and the discussion is
> > also based on mail.
> >
> > When we don't have the first draft, discussing the details on the mailing
> > list may confuse others and easily deviate from the topic. Our initial
> plan
> > was to facilitate community discussions and reviews when we had a draft
> of
> > the documentation available to the community.
> >
> > Best,
> > Vino
> >
> > Suneel Marthi  于2019年8月2日周五 下午10:37写道:
> >
> > > Please keep all discussions to Mailing lists here - no offline
> > discussions
> > > please.
> > >
> > > On Fri, Aug 2, 2019 at 10:22 AM vino yang 
> wrote:
> > >
> > > > Hi guys,
> > > >
> > > > Currently, I, Taher and Vinay are working on issue HUDI-184.[1]
> > > >
> > > > As a first step, we are discussing the design doc.
> > > >
> > > > After diving into the code, We listed some relevant classes about the
> > > Spark
> > > > delta writer.
> > > >
> > > >- module: hoodie-utilities
> > > >
> > > > com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> > > > com.uber.hoodie.utilities.deltastreamer.DeltaSyncService
> > > > com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter
> > > > com.uber.hoodie.utilities.schema.SchemaProvider
> > > > com.uber.hoodie.utilities.transform.Transformer
> > > >
> > > >- module: hoodie-client
> > > >
> > > > com.uber.hoodie.HoodieWriteClient (to commit compaction)
> > > >
> > > >
> > > > The fact is *hoodie-utilities* depends on *hoodie-client*, however,
> > > > *hoodie-client* is also not a pure Hudi component, it also depends on
> > > Spark
> > > > lib.
> > > >
> > > > So I propose hoodie should provide a pure hoodie-client and decouple
> > with
> > > > Spark. Then Flink and Spark modules should depend on it.
> > > >
> > > > Moreover, based on the old discussion[2], we all agree that Spark is
> > not
> > > > the only choice for Hudi, it could also be Flink/Beam.
> > > >
> > > > IMO, We should decouple Hudi from Spark at the height of the project,
> > > > including but not limited to module splitting and renaming.
> > > >
> > > > Not sure if this requires a HIP to drive.
> > > >
> > > > We should first listen to the opinions of the community. Any ideas
> and
> > > > suggestions are welcome and appreciated.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1
> > > > [2]:
> > > >
> > > >
> > >
> >
> https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E
> > > >
> > >
> >
>


Re: [DISCUSS] Decouple Hudi and Spark

2019-08-02 Thread vino yang
Hi Suneel,

Thank you for your suggestion, let me clarify.


*The context of this email is that we are evaluating how to implement a
Stream Delta writer base on Flink.*
About the discussion between me, Taher and Vinay, those are just some
trivial details in the preparation of the document, and the discussion is
also based on mail.

When we don't have the first draft, discussing the details on the mailing
list may confuse others and easily deviate from the topic. Our initial plan
was to facilitate community discussions and reviews when we had a draft of
the documentation available to the community.

Best,
Vino

Suneel Marthi  于2019年8月2日周五 下午10:37写道:

> Please keep all discussions to Mailing lists here - no offline discussions
> please.
>
> On Fri, Aug 2, 2019 at 10:22 AM vino yang  wrote:
>
> > Hi guys,
> >
> > Currently, I, Taher and Vinay are working on issue HUDI-184.[1]
> >
> > As a first step, we are discussing the design doc.
> >
> > After diving into the code, We listed some relevant classes about the
> Spark
> > delta writer.
> >
> >- module: hoodie-utilities
> >
> > com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> > com.uber.hoodie.utilities.deltastreamer.DeltaSyncService
> > com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter
> > com.uber.hoodie.utilities.schema.SchemaProvider
> > com.uber.hoodie.utilities.transform.Transformer
> >
> >- module: hoodie-client
> >
> > com.uber.hoodie.HoodieWriteClient (to commit compaction)
> >
> >
> > The fact is *hoodie-utilities* depends on *hoodie-client*, however,
> > *hoodie-client* is also not a pure Hudi component, it also depends on
> Spark
> > lib.
> >
> > So I propose hoodie should provide a pure hoodie-client and decouple with
> > Spark. Then Flink and Spark modules should depend on it.
> >
> > Moreover, based on the old discussion[2], we all agree that Spark is not
> > the only choice for Hudi, it could also be Flink/Beam.
> >
> > IMO, We should decouple Hudi from Spark at the height of the project,
> > including but not limited to module splitting and renaming.
> >
> > Not sure if this requires a HIP to drive.
> >
> > We should first listen to the opinions of the community. Any ideas and
> > suggestions are welcome and appreciated.
> >
> > Best,
> > Vino
> >
> > [1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1
> > [2]:
> >
> >
> https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E
> >
>


Re: [DISCUSS] Decouple Hudi and Spark

2019-08-02 Thread Suneel Marthi
Please keep all discussions to Mailing lists here - no offline discussions
please.

On Fri, Aug 2, 2019 at 10:22 AM vino yang  wrote:

> Hi guys,
>
> Currently, I, Taher and Vinay are working on issue HUDI-184.[1]
>
> As a first step, we are discussing the design doc.
>
> After diving into the code, We listed some relevant classes about the Spark
> delta writer.
>
>- module: hoodie-utilities
>
> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> com.uber.hoodie.utilities.deltastreamer.DeltaSyncService
> com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter
> com.uber.hoodie.utilities.schema.SchemaProvider
> com.uber.hoodie.utilities.transform.Transformer
>
>- module: hoodie-client
>
> com.uber.hoodie.HoodieWriteClient (to commit compaction)
>
>
> The fact is *hoodie-utilities* depends on *hoodie-client*, however,
> *hoodie-client* is also not a pure Hudi component, it also depends on Spark
> lib.
>
> So I propose hoodie should provide a pure hoodie-client and decouple with
> Spark. Then Flink and Spark modules should depend on it.
>
> Moreover, based on the old discussion[2], we all agree that Spark is not
> the only choice for Hudi, it could also be Flink/Beam.
>
> IMO, We should decouple Hudi from Spark at the height of the project,
> including but not limited to module splitting and renaming.
>
> Not sure if this requires a HIP to drive.
>
> We should first listen to the opinions of the community. Any ideas and
> suggestions are welcome and appreciated.
>
> Best,
> Vino
>
> [1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1
> [2]:
>
> https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E
>