Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread Taher Koitawala
Hi Vino,
  This is not a design for Hudi on Flink. This was simply a mock up of
tagLocations() spark cache to Flink state as Vinoth wanted to see.

As per the Flink batch and Streaming I am well aware of the batch and
Stream unification efforts of Flink. However I think that is still on
progress and for now to batch on Hudi let's stick to Spark, however only
for streaming let's use Flink streaming.



Regards,
Taher Koitawala

On Wed, Sep 25, 2019, 7:52 AM vino yang  wrote:

> Hi Taher,
>
> As I mentioned in the previous mail. Things may not be too easy by just
> using Flink state API.
>
> Copied here "Hudi can connect with many different Source/Sinks. Some
> file-based reads are not appropriate for Flink Streaming."
>
> Although, unify Batch and Streaming is Flink's goal. But, it is difficult
> to ignore Flink Batch API to match some features provide by Hudi now.
>
> The example you provided is in application layer about usage. So my
> suggestion is be patient, it needs time to give an detailed design.
>
> Best,
> Vino
>
>
>
> On 09/24/2019 22:38, Taher Koitawala  wrote:
> Hi All,
>  Sample code to see how records tagging will be handled in
> Flink is posted on [1]. The main class to run the same is MockHudi.java
> with a sample path for checkpointing.
>
> As of now this is just a sample to know we should ke caching in Flink
> states with bare minimum configs.
>
>
> As per my experience I have cached around 10s of TBs in Flink rocksDB
> state
> with the right configs. So I'm sure it should work here as well.
>
> 1:
>
> https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
>
> Regards,
> Taher Koitawala
>
>
> On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar  wrote:
>
> > It wont be much different than the HBaseIndex we have today. Would like
> to
> > have always have an option like BloomIndex that does not need any
> external
> > dependencies.
> > The moment you bring an external data store in, someone becomes a DBA.
> :)
> >
> > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng 
> > wrote:
> >
> > > @vc can you see how ApacheCrail could be used to implement this at
> scale
> > > but also in a way that abstracts over both Spark and Flink?
> > >
> > > "Crail Store implements a hierarchical namespace across a cluster of
> RDMA
> > > interconnected storage resources such as DRAM or flash"
> > >
> > > https://crail.incubator.apache.org/overview/
> > >
> > > + 2 cents
> > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20
> > >
> > > Cheers
> > >
> > > Nick
> > >
> > > On September 22, 2019 at 9:28 AM Vinoth Chandar 
> > wrote:
> > >
> > >
> > > It could be much larger. :) imagine billions of keys each 32 bytes,
> > mapped
> > > to another 32 byte
> > >
> > > The advantage of the current bloom index is that its effectively
> stored
> > > with data itself and this reduces complexity in terms of keeping index
> > and
> > > data consistent etc
> > >
> > > One orthogonal idea from long time ago that moves indexing out of data
> > > storage and is generalizable
> > >
> > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index
> > >
> > > If someone here knows flink well and can implement some standalone
> flink
> > > code to mimic tagLocation() functionality and share with the group,
> that
> > > would be great. Lets worry about performance once we have a flink DAG.
> I
> > > think this is a critical and most tricky piece in supporting flink.
> > >
> > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil 
> > > wrote:
> > >
> > > Hi Taher,
> > >
> > > I agree with this , if the state is becoming too large we should have
> an
> > > option of storing it in external state like File System or RocksDb.
> > >
> > > @Vinoth Chandar  can the state of HoodieBloomIndex
> go
> > > beyond 10-15 GB
> > >
> > > Regards,
> > > Vinay Patil
> > >
> > > >
> > >
> > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala 
> > > wrote:
> > >
> > > >> Hey Guys, Any thoughts on the above idea? To handle
> HoodieBloomIndex
> > > with
> > > >> HeapState, RocksDBState and FsState but on Spark.
> > > >>
> > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala 
>
> > > >> wrote:
> > > >>
> > > >> > Hi Vinoth,
> > > >> > Having seen the doc and code. I understand the
> > > >> > HoodieBloomIndex mainly caches key and partition path. Can we
> > address
> > > >> how
> > > >> > Flink does it? Like, have HeapState where the user chooses to
> cache
> > > the
> > > >> > Index on heap, RockDBState where indexes are written to RocksDB
> and
> > > >> finally
> > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And
> on
> > > top,
> > > >> we
> > > >> > can do an index Time To Live.
> > > >> >
> > > >> > Regards,
> > > >> > Taher Koitawala
> > > >> >
> > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar <
> vin...@apache.org>
> > > >> wrote:
> > > >> >
> > > >> >> I still feel the key thing here is reimplementing
> HoodieBloomIndex
> > > >> without
> > > >> >> needing 

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread vino yang
Hi Taher, As I mentioned in the previous mail. Things may not be too easy by 
just using Flink state API. Copied here "Hudi can connect with many different 
Source/Sinks. Some file-based reads are not appropriate for Flink Streaming." 
Although, unify Batch and Streaming is Flink's goal. But, it is difficult to 
ignore Flink Batch API to match some features provide by Hudi now. The example 
you provided is in application layer about usage. So my suggestion is be 
patient, it needs time to give an detailed design. Best, Vino On 09/24/2019 
22:38, Taher Koitawala wrote: Hi All,              Sample code to see how 
records tagging will be handled in Flink is posted on [1]. The main class to 
run the same is MockHudi.java with a sample path for checkpointing. As of now 
this is just a sample to know we should ke caching in Flink states with bare 
minimum configs. As per my experience I have cached around 10s of TBs in Flink 
rocksDB state with the right configs. So I'm sure it should work here as well. 
1: 
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
 Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar 
 wrote: > It wont be much different than the HBaseIndex we 
have today. Would like to > have always have an option like BloomIndex that 
does not need any external > dependencies. > The moment you bring an external 
data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM 
Semantic Beeng  > wrote: > > > @vc can you see how 
ApacheCrail could be used to implement this at scale > > but also in a way that 
abstracts over both Spark and Flink? > > > > "Crail Store implements a 
hierarchical namespace across a cluster of RDMA > > interconnected storage 
resources such as DRAM or flash" > > > > 
https://crail.incubator.apache.org/overview/ > > > > + 2 cents > > 
https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > 
Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar 
 > wrote: > > > > > > It could be much larger. :) imagine 
billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The 
advantage of the current bloom index is that its effectively stored > > with 
data itself and this reduces complexity in terms of keeping index > and > > 
data consistent etc > > > > One orthogonal idea from long time ago that moves 
indexing out of data > > storage and is generalizable > > > > 
https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone 
here knows flink well and can implement some standalone flink > > code to mimic 
tagLocation() functionality and share with the group, that > > would be great. 
Lets worry about performance once we have a flink DAG. I > > think this is a 
critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21, 
2019 at 4:17 AM Vinay Patil  > > wrote: > > > > Hi 
Taher, > > > > I agree with this , if the state is becoming too large we should 
have an > > option of storing it in external state like File System or RocksDb. 
> > > > @Vinoth Chandar  can the state of HoodieBloomIndex 
go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On 
Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala  > > wrote: > 
> > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex > 
> with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >> 
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala  > > >> 
wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I 
understand the > > >> > HoodieBloomIndex mainly caches key and partition path. 
Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where 
the user chooses to cache > > the > > >> > Index on heap, RockDBState where 
indexes are written to RocksDB and > > >> finally > > >> > FsState where 
indexes can be written to HDFS, S3, Azure Fs. And on > > top, > > >> we > > >> 
> can do an index Time To Live. > > >> > > > >> > Regards, > > >> > Taher 
Koitawala > > >> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar 
 > > >> wrote: > > >> > > > >> >> I still feel the key thing 
here is reimplementing HoodieBloomIndex > > >> without > > >> >> needing spark 
caching. > > >> >> > > >> >> > > >> > > > 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global
 > > ) > > >> >> documents the spark DAG in detail. > > >> >> > > >> >> If 
everyone feels, it's best for me to scope the work out, then > happy > > >> to 
> > >> >> do > > >> >> it! > > >> >> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM 
Taher Koitawala < > taher...@gmail.com > > > > > >> >> wrote: > > >> >> > > >> 
>> > Guys I think we are slowing down on this again. We need to start > > >> >> 
planning > > >> >> > small small tasks towards this VC please can you help fast 
track > > >> this? > > >> >> > > > >> >> > Regards, > > >> >> > Taher Koitawala 
> > >> >> > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM 

Re: FAQ page

2019-09-24 Thread vino yang
I have read the FAQ page. It looks good to me. There are many valuable and high 
freequency questions. I have a suggestion. Besides Hudi, there are another two 
projects towards data lake: Iceberg and Delta Lake. If we can give some 
comparation between Hudi and them. It would be good. It is a high freequency 
question which has been asked to me. What do you think? On 09/25/2019 01:16, 
Bhavani Sudha Saktheeswaran wrote: This is really cool. Thanks for putting this 
page together Vinoth ! On Tue, Sep 24, 2019 at 7:39 AM Nishith 
 wrote: > The FAQ looks awesome Vinoth! Answers most of 
the questions that folks are > confused about. > Hoping folks can contribute 
more as we uncover more frequently asked > questions. > > - Nishith > > Sent 
from my iPhone > > > On Sep 23, 2019, at 5:51 PM, vino yang 
 wrote: > > > > Thanks for your great work, Vinoth and 
Nishith. Will have a look soon. > On 09/24/2019 07:51, vbal...@apache.org 
wrote: +1 Awesome job Vinoth and > Nishith for compiling the initial version of 
FAQ. Agree on the idea of > replying using FAQ.  Balaji.V    On Monday, 
September 23, 2019, 04:41:03 PM > PDT, Vinoth Chandar  
wrote:   First version of the > page is now fully completed. > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo=
 > Please try to use the FAQs when answering questions on ML and GH. It will > 
only get better if we manage this effectively and keep improving it. On > Sun, 
Sep 15, 2019 at 9:41 PM Vinoth Chandar  wrote: > > Thanks! 
Will work this week to fill out most answers! > Your help reviewing > would 
also be much appreciated. > Will keep this thread posted.. > > On > Tue, Sep 
10, 2019 at 6:10 PM vino yang  wrote: > > >> Hi Vinoth, 
>> >> Great job! Thanks for your efforts! >> I think this > page is good for 
users and developers to let them know Hudi >> well. >> >> > Best, >> Vino >> >> 
>> >> Vinoth Chandar  > 于2019年9月11日周三 上午2:27写道: >> >> > Hi 
all, >> > >> > I wrote a list of > questions based on mailing list 
conversations and >> issues. >> > >> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo=
 > >> > < >> > >> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo=
 > >> > > >> > >> > While I am still working through answers, I thought this > 
can be a good >> > community driven process. >> > >> > >> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185-23Frequentlyaskedquestions-28FAQ-29-2DContributingtoFAQ=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=z9N-muShuD6vVio51IJSiTssAaxJ5OdkN97slqOo8qU=
 > >> > < >> > >> > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185-23Frequentlyaskedquestions-28FAQ-29-2DContributingtoFAQ=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=z9N-muShuD6vVio51IJSiTssAaxJ5OdkN97slqOo8qU=
 > >> > > >> > >> > Please help by contributing answers or new questions if > 
you can! >> > >> > thanks >> > vinoth >> > >> > >

Re: FAQ page

2019-09-24 Thread Bhavani Sudha Saktheeswaran
This is really cool. Thanks for putting this page together Vinoth !


On Tue, Sep 24, 2019 at 7:39 AM Nishith  wrote:

> The FAQ looks awesome Vinoth! Answers most of the questions that folks are
> confused about.
> Hoping folks can contribute more as we uncover more frequently asked
> questions.
>
> - Nishith
>
> Sent from my iPhone
>
> > On Sep 23, 2019, at 5:51 PM, vino yang  wrote:
> >
> > Thanks for your great work, Vinoth and Nishith. Will have a look soon.
> On 09/24/2019 07:51, vbal...@apache.org wrote: +1 Awesome job Vinoth and
> Nishith for compiling the initial version of FAQ. Agree on the idea of
> replying using FAQ.  Balaji.VOn Monday, September 23, 2019, 04:41:03 PM
> PDT, Vinoth Chandar  wrote:   First version of the
> page is now fully completed.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo=
> Please try to use the FAQs when answering questions on ML and GH. It will
> only get better if we manage this effectively and keep improving it. On
> Sun, Sep 15, 2019 at 9:41 PM Vinoth Chandar  wrote: >
> Thanks! Will work this week to fill out most answers! > Your help reviewing
> would also be much appreciated. > Will keep this thread posted.. > > On
> Tue, Sep 10, 2019 at 6:10 PM vino yang  wrote: >
> >> Hi Vinoth, >> >> Great job! Thanks for your efforts! >> I think this
> page is good for users and developers to let them know Hudi >> well. >> >>
> Best, >> Vino >> >> >> >> Vinoth Chandar 
> 于2019年9月11日周三 上午2:27写道: >> >> > Hi all, >> > >> > I wrote a list of
> questions based on mailing list conversations and >> issues. >> > >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo=
> >> > < >> > >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo=
> >> > > >> > >> > While I am still working through answers, I thought this
> can be a good >> > community driven process. >> > >> > >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185-23Frequentlyaskedquestions-28FAQ-29-2DContributingtoFAQ=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=z9N-muShuD6vVio51IJSiTssAaxJ5OdkN97slqOo8qU=
> >> > < >> > >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185-23Frequentlyaskedquestions-28FAQ-29-2DContributingtoFAQ=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=z9N-muShuD6vVio51IJSiTssAaxJ5OdkN97slqOo8qU=
> >> > > >> > >> > Please help by contributing answers or new questions if
> you can! >> > >> > thanks >> > vinoth >> > >> >
>


Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread Taher Koitawala
Hi All,
  Sample code to see how records tagging will be handled in
Flink is posted on [1]. The main class to run the same is MockHudi.java
with a sample path for checkpointing.

As of now this is just a sample to know we should ke caching in Flink
states with bare minimum configs.


As per my experience I have cached around 10s of TBs in Flink rocksDB state
with the right configs. So I'm sure it should work here as well.

1:
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi

Regards,
Taher Koitawala


On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar  wrote:

> It wont be much different than the HBaseIndex we have today. Would like to
> have always have an option like BloomIndex that does not need any external
> dependencies.
> The moment you bring an external data store in, someone becomes a DBA. :)
>
> On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng 
> wrote:
>
> > @vc can you see how ApacheCrail could be used to implement this at scale
> > but also in a way that abstracts over both Spark and Flink?
> >
> > "Crail Store implements a hierarchical namespace across a cluster of RDMA
> > interconnected storage resources such as DRAM or flash"
> >
> > https://crail.incubator.apache.org/overview/
> >
> > + 2 cents
> > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20
> >
> > Cheers
> >
> > Nick
> >
> > On September 22, 2019 at 9:28 AM Vinoth Chandar 
> wrote:
> >
> >
> > It could be much larger. :) imagine billions of keys each 32 bytes,
> mapped
> > to another 32 byte
> >
> > The advantage of the current bloom index is that its effectively stored
> > with data itself and this reduces complexity in terms of keeping index
> and
> > data consistent etc
> >
> > One orthogonal idea from long time ago that moves indexing out of data
> > storage and is generalizable
> >
> > https://github.com/apache/incubator-hudi/wiki/HashMap-Index
> >
> > If someone here knows flink well and can implement some standalone flink
> > code to mimic tagLocation() functionality and share with the group, that
> > would be great. Lets worry about performance once we have a flink DAG. I
> > think this is a critical and most tricky piece in supporting flink.
> >
> > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil 
> > wrote:
> >
> > Hi Taher,
> >
> > I agree with this , if the state is becoming too large we should have an
> > option of storing it in external state like File System or RocksDb.
> >
> > @Vinoth Chandar  can the state of HoodieBloomIndex go
> > beyond 10-15 GB
> >
> > Regards,
> > Vinay Patil
> >
> > >
> >
> > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala 
> > wrote:
> >
> > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex
> > with
> > >> HeapState, RocksDBState and FsState but on Spark.
> > >>
> > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala 
> > >> wrote:
> > >>
> > >> > Hi Vinoth,
> > >> > Having seen the doc and code. I understand the
> > >> > HoodieBloomIndex mainly caches key and partition path. Can we
> address
> > >> how
> > >> > Flink does it? Like, have HeapState where the user chooses to cache
> > the
> > >> > Index on heap, RockDBState where indexes are written to RocksDB and
> > >> finally
> > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And on
> > top,
> > >> we
> > >> > can do an index Time To Live.
> > >> >
> > >> > Regards,
> > >> > Taher Koitawala
> > >> >
> > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar 
> > >> wrote:
> > >> >
> > >> >> I still feel the key thing here is reimplementing HoodieBloomIndex
> > >> without
> > >> >> needing spark caching.
> > >> >>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global
> > )
> > >> >> documents the spark DAG in detail.
> > >> >>
> > >> >> If everyone feels, it's best for me to scope the work out, then
> happy
> > >> to
> > >> >> do
> > >> >> it!
> > >> >>
> > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <
> taher...@gmail.com
> > >
> > >> >> wrote:
> > >> >>
> > >> >> > Guys I think we are slowing down on this again. We need to start
> > >> >> planning
> > >> >> > small small tasks towards this VC please can you help fast track
> > >> this?
> > >> >> >
> > >> >> > Regards,
> > >> >> > Taher Koitawala
> > >> >> >
> > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar  >
> > >> >> wrote:
> > >> >> >
> > >> >> > > Look forward to the analysis. A key class to read would be
> > >> >> > > HoodieBloomIndex, which uses a lot of spark caching and
> shuffles.
> > >> >> > >
> > >> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <
> yanghua1...@gmail.com
> > >
> > >> >> wrote:
> > >> >> > >
> > >> >> > > > >> Currently Spark Streaming micro batching fits well with
> > Hudi,
> > >> >> since
> > >> >> > it
> > >> >> > > > amortizes the cost of indexing, workload profiling etc. 1
> spark
> > >> >> micro
> > >> >> > > batch
> > >> >> > > > = 1 hudi commit
> > >> >> > > > With the 

Re: [PROPOSAL] Hudi Web UI

2019-09-24 Thread Vinoth Chandar
Thanks for doing it! Will review sometime this week.

On Mon, Sep 23, 2019 at 5:43 PM vino yang  wrote:

> Thanks Taher, great job! Will have another look soon. Best, Vino On
> 09/24/2019 02:15, Taher Koitawala wrote: Hi All, Hip has been
> migrated to confluence. Please take a look.
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233
> Regards, Taher Koitawala On Mon, Sep 23, 2019 at 10:17 PM Taher Koitawala <
> taher...@gmail.com> wrote: > Yup got it. Thanks Vinoth > > On Mon, Sep
> 23, 2019, 10:02 PM Vinoth Chandar  wrote: > >> Now? >>
> >> On Mon, Sep 23, 2019 at 9:05 AM Taher Koitawala 
> >> wrote: >> >> > Hey Vinoth still the same. No perms >> > >> > On Mon, Sep
> 23, 2019 at 9:24 PM Vinoth Chandar  >> wrote: >> > >>
> > > thats correct. you should have perms now. >> > > >> > > On Sun, Sep 22,
> 2019 at 11:06 PM Taher Koitawala  >> > > wrote: >> >
> > >> > > > Hi Vinoth, >> > > >I do not have access on
> confluence to create a page >> in >> > the >> > > > HIP section or even
> copy the HIP template. Please can you give >> access? >> > My >> > > > id
> is taherk77. >> > > > >> > > > Regards, >> > > > Taher Koitawala >> > > >
> >> > > > On Sun, Sep 22, 2019 at 6:34 PM Vinoth Chandar 
> >> > > wrote: >> > > > >> > > > > Taher, can we please move the HIP to the
> cWiki space as documented >> > here >> > > > > >> > > > > >> > > > >> > >
> >> > >>
> https://cwiki.apache.org/confluence/display/HUDI/Hudi+Improvement+Plan+Details+and+Process
> >> > > > > >> > > > > >> > > > > Would love to take a pass at it. This will
> definitely improve >> > > usability.. >> > > > > >> > > > > @leesf I think
> we can use the standalone timeline server bundle >> you >> > > > worked >>
> > > > > on, and have a central timeline server for all Hudi jobs within >>
> your >> > > > org.. >> > > > > Then, if you host an UI endpoint on that
> server, we can visualize >> > much >> > > of >> > > > > what Tahen put up
> on the doc. thoughts? >> > > > > >> > > > > On Sun, Sep 22, 2019 at 4:07 AM
> Taher Koitawala < >> taher...@gmail.com> >> > > > > wrote: >> > > > > >>
> > > > > > Hi Leesf, >> > > > > >Thank you for your interest.
> HIP already has been >> > > > implemented >> > > > > in >> > > > > > terms
> of design and components we need to see. The link is given >> > > below. >>
> > > > > > Please free to chime in, to add and implement. >> > > > > > >> >
> > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >>
> https://docs.google.com/document/d/1oEjukuaK2ltqiD0sjVs5IUzvDzilWF0viwMASXTdEd4/edit?usp=sharing
> >> > > > > > >> > > > > > Regards, >> > > > > > Taher Koitawala >> > > > >
> > >> > > > > > On Sun, Sep 22, 2019, 3:26 PM leesf 
> >> wrote: >> > > > > > >> > > > > > > + 1 to this valuable HIP. >> > > > >
> > > And I am also very interested, perhaps together to implement >> this >>
> > > > HIP. >> > > > > > > >> > > > > > > Best, >> > > > > > > Leesf >> > >
> > > > > >> > > > > > > Taher Koitawala  于2019年9月22日周日
> 下午2:09写道: >> > > > > > > >> > > > > > > > Guys it not only includes tables
> views and admin kind of >> > > > > > > views(Compactions) >> > > > > > > >
> but it also includes 'Metadata lineage' which can help users >> > know >> >
> > > how >> > > > > > > this >> > > > > > > > Hudi dataset merged and also
> another strong feature is >> creating >> > > > > > > > DeltaStreamer jobs
> through the webui and having >> DeltaStreamer >> > > > > > > >
> templates(Makes sharing jobs easy). I think those are really >> > > really
> >> > > > > > > strong >> > > > > > > > features. >> > > > > > > > >> > > >
> > > > > >> > > > > > > > Regards, >> > > > > > > > Taher Koitawala >> > > >
> > > > > >> > > > > > > > On Sun, Sep 22, 2019, 9:52 AM Bhavani Sudha
> Saktheeswaran >> > > > > > > >  wrote: >> > >
> > > > > > >> > > > > > > > > +1 for adding web ui. The web ui viz for table
> configs >> would >> > be >> > > > > > pretty >> > > > > > > > > useful for
> easy debugging. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >>
> > > > > > > > > >> > > > > > > > > On Sat, Sep 21, 2019 at 7:35 PM Vinoth
> Chandar < >> > > > vin...@apache.org> >> > > > > > > > wrote: >> > > > >
> > > > > >> > > > > > > > > > +1 will take a look at the doc for specifics
> in a few >> days. >> > > > > > > > > > >> > > > > > > > > > On Sat, Sep 21,
> 2019 at 7:18 PM vino yang < >> > > > yanghua1...@gmail.com >> > > > > >
> >> > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > +1 to
> introduce Hudi web UI. Great suggestion! On >> > > 09/21/2019 >> > > > > >
> 12:24, >> > > > > > > > > Minh >> > > > > > > > > > > Pham wrote: +1. I
> think an admin UI will help with >> > > > reusability >> > > > > > > alot.
> >> > > > > > > > On >> > > > > > > > > > > Fri, Sep 20, 2019 at 8:32 PM
> Vinay Patil < >> > > > > > vinay18.pa...@gmail.com> >> > > > > > > > > >
> wrote: >> > > > > > > > > > > > Hi Taher, > > I really liked this idea,
> these >> details >> > > will >> > > > be >> > > > 

Re: [DISCUSS] Hudi with Nifi

2019-09-24 Thread Vinoth Chandar
Sg, lets capture these discussions in the JIRA (link to the discussion
thread should suffice) and we can revisit one by one..

On Mon, Sep 23, 2019 at 8:31 PM Taher Koitawala  wrote:

> Sure Vinoth, I think we need to try this out and check how it fits together
> and how deployable it is.
>
> On Sun, Sep 22, 2019, 7:01 PM Vinoth Chandar  wrote:
>
> > See a lot of Spark Streaming receiver based approach code there, which
> > makes me a bit worried about scalability.
> >
> > Nonetheless. API wise cant we just so dstream.rdd.forEach? And issue
> these
> > writes using the WriteClient api?
> >
> > On Sat, Sep 21, 2019 at 4:16 AM Taher Koitawala 
> > wrote:
> >
> > > Hi Vinoth,
> > > Nifi has the capability to pass data to a custom spark
> > job.
> > > However that is done through a StreamingContext, not sure if we can
> build
> > > something on this. I'm trying to wrap my head around how to fit the
> > > StreamingContext in our existing code.
> > >
> > > Here is an example:
> > > https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
> > >
> > > Regards,
> > > Taher Koitawala
> > >
> > > On Wed, Sep 18, 2019, 8:27 PM Vinoth Chandar 
> wrote:
> > >
> > > > Not too familiar wth Nifi myself. Would this still target an use-case
> > > like
> > > > what pratyaksh mentioned?
> > > > For delta streamer specifically, we are moving more and more towards
> > > > continuous mode, where
> > > > Hudi writing and compaction are amanged by a single long running
> spark
> > > > application.
> > > >
> > > > Would Nifi also help us manage compactions when working with Hudi
> > > > datasource or just writing plain spark Hudi pipelines?
> > > >
> > > > On 2019/09/18 08:18:44, Taher Koitawala  wrote:
> > > > > That's another way of doing things. I want to know if someone wrote
> > > > > something like PutParquet. Which directly can write data to Hudi.
> > > AFAIK I
> > > > > don't think anyone has.
> > > > >
> > > > > That will really be powerful.
> > > > >
> > > > > On Wed, Sep 18, 2019, 1:37 PM Pratyaksh Sharma <
> > pratyaks...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Taher,
> > > > > >
> > > > > > In the initial phase of our CDC pipeline, we were using Hudi with
> > > Nifi.
> > > > > > Nifi was being used to read Binlog file of mysql and to push that
> > > data
> > > > to
> > > > > > some Kafka topic. This topic was then getting consumed by
> > > > DeltaStreamer. So
> > > > > > Nifi was indirectly involved in that flow.
> > > > > >
> > > > > > On Wed, Sep 18, 2019 at 10:29 AM Taher Koitawala <
> > taher...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >   Just wanted to know has anyone tried to write data to
> > > Hudi
> > > > > > with a
> > > > > > > Nifi flow?
> > > > > > >
> > > > > > > Perhaps may be just a csv file on local to Hudi dataset? If not
> > > then
> > > > lets
> > > > > > > try that!
> > > > > > >
> > > > > > > Regards,
> > > > > > > Taher Koitawala
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Field not found in record HoodieException

2019-09-24 Thread Kabeer Ahmed
Taher,

Sorry I got a bit delayed. I have now put everything you may need in a gist at: 
https://gist.github.com/smdahmed/3af0e3110e07cf76772bb73d5e9b65e2 
(https://link.getmailspring.com/link/930f6985-8e72-4efd-9c97-85965911e...@getmailspring.com/0?redirect=https%3A%2F%2Fgist.github.com%2Fsmdahmed%2F3af0e3110e07cf76772bb73d5e9b65e2=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D)
Note that I am still on 0.4.6. So you may need to swap com.uber.hoodie with 
right org.apache.hudi etc. And I am still on the RDDs based implementation. But 
I can assure you that if you swap the code with a dataframe based 
implementation, it will still work same. If you are looking for DataFrame based 
implementation look at the code sample at: 
https://github.com/apache/incubator-hudi/issues/859#issuecomment-527316262 
(https://link.getmailspring.com/link/930f6985-8e72-4efd-9c97-85965911e...@getmailspring.com/1?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F859%23issuecomment-527316262=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D)

You will see in my gist at: 
https://gist.github.com/smdahmed/3af0e3110e07cf76772bb73d5e9b65e2 
(https://link.getmailspring.com/link/930f6985-8e72-4efd-9c97-85965911e...@getmailspring.com/2?redirect=https%3A%2F%2Fgist.github.com%2Fsmdahmed%2F3af0e3110e07cf76772bb73d5e9b65e2=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D)
 the following:
Code sample to generate parquet

Hive Table creation and addition of partitions

Spark Shell based code that is inline with what you had needed.

If you want any changes to be made, please do not hesitate. I can modify the 
code and able to spin tests for you. But I can assure you that this will work 
and to the best of my belief, this is what you had aimed to achieve.

Thanks
Kabeer.

On Sep 18 2019, at 5:13 pm, Taher Koitawala  wrote:
> Hi Kabeer,
> Really appreciate the help. Take your time nothing urgent.
>
> Regards,
> Taher Koitawala
>
> On Wed, Sep 18, 2019, 9:38 PM Kabeer Ahmed  wrote:
> > Taher,
> > I have a half baked code for test. I shall complete it and test it and
> > revert back to you - latest by weekend. Please bear with me. If it is super
> > urgent or you are really stuck, then let me know.
> > Thanks,
> > On Sep 18 2019, at 7:27 am, Gary Li  wrote:
> > > I think we can also try to find if there is any illegal character that
> > > could mess up Avro scheme in the column. Like a stand alone “/“ or “.”
> > >
> > > On Tue, Sep 17, 2019 at 8:35 PM Vinoth Chandar 
> > wrote:
> > > > [Orthogonal comment] It's so awesome to see us troubleshooting
> > >
> >
> > together..
> > > > Thanks everyone on this thread!
> > > >
> > > > On Tue, Sep 17, 2019 at 8:04 PM Taher Koitawala 
> > > > wrote:
> > > >
> > > > > No there are no nulls in the data and I am getting the same error.
> > > > > On Wed, Sep 18, 2019, 3:33 AM Kabeer Ahmed 
> > > >
> > >
> >
> > wrote:
> > > > > > Taher - did you find any NULLs in the data? If you are still not
> > > > >
> > > >
> > >
> >
> > able
> > > > to
> > > > > > make progress, let us know.
> > > > > >
> > > > > > On Sep 17 2019, at 8:30 am, Taher Koitawala 
> > > > wrote:
> > > > > > > Sure Gary, Let me check if i can find any nulls in there
> > > > > > >
> > > > > > > On Tue, Sep 17, 2019 at 1:28 AM Gary Li <
> > yanjia.gary...@gmail.com>
> > > > > > wrote:
> > > > > > > > Hello, I have seen this exception before. In my case, if the
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > precombine key
> > > > > > > > of one entry is null, then I will have this error. I'd
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > recommend
> > > > > > >
> > > > > >
> > > > > > checking
> > > > > > > > if there is any row has null in *last_update.*
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Gary
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Sep 16, 2019 at 12:32 PM Kabeer Ahmed <
> > > > kab...@linuxmail.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Taher,
> > > > > > > > > Let me spin a test for you to test similar scenario and let
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > me
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > revert
> > > > > > > > back
> > > > > > > > > to you.
> > > > > > > > > On Sep 16 2019, at 2:09 pm, Taher Koitawala <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > taher...@gmail.com>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > wrote:
> > > > > > > > > > Hi Kabeer, hive table has everything as a string. However
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > when
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > fetching
> > > > > > > > > > data, the spark query is
> > > > > > > > > > .sql(String.format("select
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > contact_id,country,cast(last_update
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > as
> > > > > > > > > >