Re: [DISCUSS] Hudi is the data lake platform

2021-08-04 Thread Vinoth Chandar
Folks,

I have been digesting some feedback on what we show on the home page itself.

While the blog explains the vision, it might be good to bubble up sub-areas
that are
more relevant to our users today. transactions, updates, deletes.

So, i have raised a PR moving stuff around.

Now we lead with
- "Hudi brings transactions, record-level updates/deletes and change
streams to data lakes"

then explain the platform, in the next level of detail.

https://github.com/apache/hudi/pull/3406

On Mon, Aug 2, 2021 at 9:39 AM Vinoth Chandar  wrote:

> Thanks! Will work on it this week.
> Also redoing some images based on feedback.
>
> On Fri, Jul 30, 2021 at 2:06 AM vino yang  wrote:
>
>> +1
>>
>> Pratyaksh Sharma  于2021年7月30日周五 上午1:47写道:
>>
>> > Guess we should rebrand Hudi on README.md file as well -
>> > https://github.com/apache/hudi#readme?
>> >
>> > This page still mentions the following -
>> >
>> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
>> > Incrementals. Hudi manages the storage of large analytical datasets on
>> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
>> >
>> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar 
>> wrote:
>> >
>> >> Thanks Vino! Got a bunch of emoticons on the PR as well.
>> >>
>> >> Will land this monday, giving it more time over the weekend as well.
>> >>
>> >>
>> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang 
>> wrote:
>> >>
>> >> > Thanks vc
>> >> >
>> >> > Very good blog, in-depth and forward-looking. Learned!
>> >> >
>> >> > Best,
>> >> > Vino
>> >> >
>> >> > Vinoth Chandar  于2021年7月22日周四 上午3:58写道:
>> >> >
>> >> > > Expanding to users@ as well.
>> >> > >
>> >> > > Hi all,
>> >> > >
>> >> > > Since this discussion, I started to pen down a coherent strategy
>> and
>> >> > convey
>> >> > > these ideas via a blog post.
>> >> > > I have also done my own research, talked to (ex)colleagues I
>> respect
>> >> to
>> >> > get
>> >> > > their take and refine it.
>> >> > >
>> >> > > Here's a blog that hopefully explains this vision.
>> >> > >
>> >> > > https://github.com/apache/hudi/pull/3322
>> >> > >
>> >> > > Look forward to your feedback on the PR. We are hoping to land this
>> >> early
>> >> > > next week, if everyone is aligned.
>> >> > >
>> >> > > Thanks
>> >> > > Vinoth
>> >> > >
>> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li 
>> wrote:
>> >> > >
>> >> > > > +1 , Cannot agree more.
>> >> > > >  *aux metadata* and metatable, can make hudi have large
>> preformance
>> >> > > > optimization on query end.
>> >> > > > Can continuous develop.
>> >> > > > cache service may the necessary component in cloud native
>> >> environment.
>> >> > > >
>> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar 
>> wrote:
>> >> > > > > Hello all,
>> >> > > > >
>> >> > > > > Reading one more article today, positioning Hudi, as just a
>> table
>> >> > > format,
>> >> > > > > made me wonder, if we have done enough justice in explaining
>> what
>> >> we
>> >> > > have
>> >> > > > > built together here.
>> >> > > > > I tend to think of Hudi as the data lake platform, which has
>> the
>> >> > > > following
>> >> > > > > components, of which - one if a table format, one is a
>> >> transactional
>> >> > > > > storage layer.
>> >> > > > > But the whole stack we have is definitely worth more than the
>> sum
>> >> of
>> >> > > all
>> >> > > > > the parts IMO (speaking from my own experience from the past
>> 10+
>> >> > years
>> >> > > of
>> >> > > > > open source software dev).
>> >> > > > >
>> >> > > > > Here's what we have built so far.
>> >> > > > >
>> >> > > > > a) *table format* : something that stores table schema, a
>> metadata
>> >> > > table
>> >> > > > > that stores file listing today, and being extended to store
>> column
>> >> > > ranges
>> >> > > > > and more in the future (RFC-27)
>> >> > > > > b) *aux metadata* : bloom filters, external record level
>> indexes
>> >> > today,
>> >> > > > > bitmaps/interval trees and other advanced on-disk data
>> structures
>> >> > > > tomorrow
>> >> > > > > c) *concurrency control* : we always supported MVCC based log
>> >> based
>> >> > > > > concurrency (serialize writes into a time ordered log), and we
>> now
>> >> > also
>> >> > > > > have OCC for batch merge workloads with 0.8.0. We will have
>> >> > multi-table
>> >> > > > and
>> >> > > > > fully non-blocking writers soon (see future work section of
>> >> RFC-22)
>> >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case
>> for
>> >> > Hudi,
>> >> > > > but
>> >> > > > > we support primary/unique key constraints and we could add
>> foreign
>> >> > keys
>> >> > > > as
>> >> > > > > an extension, once our transactions can span tables.
>> >> > > > > e) *table services*: a hudi pipeline today is self-managing -
>> >> sizes
>> >> > > > files,
>> >> > > > > cleans, compacts, clusters data, bootstraps existing data - all
>> >> these
>> >> > > > > actions working off each other without blocking one another.
>> (for
>> >> > most
>> >> > > > > parts).
>

Re: [DISCUSS] Hudi is the data lake platform

2021-08-02 Thread Vinoth Chandar
Thanks! Will work on it this week.
Also redoing some images based on feedback.

On Fri, Jul 30, 2021 at 2:06 AM vino yang  wrote:

> +1
>
> Pratyaksh Sharma  于2021年7月30日周五 上午1:47写道:
>
> > Guess we should rebrand Hudi on README.md file as well -
> > https://github.com/apache/hudi#readme?
> >
> > This page still mentions the following -
> >
> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> > Incrementals. Hudi manages the storage of large analytical datasets on
> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
> >
> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar 
> wrote:
> >
> >> Thanks Vino! Got a bunch of emoticons on the PR as well.
> >>
> >> Will land this monday, giving it more time over the weekend as well.
> >>
> >>
> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang 
> wrote:
> >>
> >> > Thanks vc
> >> >
> >> > Very good blog, in-depth and forward-looking. Learned!
> >> >
> >> > Best,
> >> > Vino
> >> >
> >> > Vinoth Chandar  于2021年7月22日周四 上午3:58写道:
> >> >
> >> > > Expanding to users@ as well.
> >> > >
> >> > > Hi all,
> >> > >
> >> > > Since this discussion, I started to pen down a coherent strategy and
> >> > convey
> >> > > these ideas via a blog post.
> >> > > I have also done my own research, talked to (ex)colleagues I respect
> >> to
> >> > get
> >> > > their take and refine it.
> >> > >
> >> > > Here's a blog that hopefully explains this vision.
> >> > >
> >> > > https://github.com/apache/hudi/pull/3322
> >> > >
> >> > > Look forward to your feedback on the PR. We are hoping to land this
> >> early
> >> > > next week, if everyone is aligned.
> >> > >
> >> > > Thanks
> >> > > Vinoth
> >> > >
> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li 
> wrote:
> >> > >
> >> > > > +1 , Cannot agree more.
> >> > > >  *aux metadata* and metatable, can make hudi have large
> preformance
> >> > > > optimization on query end.
> >> > > > Can continuous develop.
> >> > > > cache service may the necessary component in cloud native
> >> environment.
> >> > > >
> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> >> > > > > Hello all,
> >> > > > >
> >> > > > > Reading one more article today, positioning Hudi, as just a
> table
> >> > > format,
> >> > > > > made me wonder, if we have done enough justice in explaining
> what
> >> we
> >> > > have
> >> > > > > built together here.
> >> > > > > I tend to think of Hudi as the data lake platform, which has the
> >> > > > following
> >> > > > > components, of which - one if a table format, one is a
> >> transactional
> >> > > > > storage layer.
> >> > > > > But the whole stack we have is definitely worth more than the
> sum
> >> of
> >> > > all
> >> > > > > the parts IMO (speaking from my own experience from the past 10+
> >> > years
> >> > > of
> >> > > > > open source software dev).
> >> > > > >
> >> > > > > Here's what we have built so far.
> >> > > > >
> >> > > > > a) *table format* : something that stores table schema, a
> metadata
> >> > > table
> >> > > > > that stores file listing today, and being extended to store
> column
> >> > > ranges
> >> > > > > and more in the future (RFC-27)
> >> > > > > b) *aux metadata* : bloom filters, external record level indexes
> >> > today,
> >> > > > > bitmaps/interval trees and other advanced on-disk data
> structures
> >> > > > tomorrow
> >> > > > > c) *concurrency control* : we always supported MVCC based log
> >> based
> >> > > > > concurrency (serialize writes into a time ordered log), and we
> now
> >> > also
> >> > > > > have OCC for batch merge workloads with 0.8.0. We will have
> >> > multi-table
> >> > > > and
> >> > > > > fully non-blocking writers soon (see future work section of
> >> RFC-22)
> >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> >> > Hudi,
> >> > > > but
> >> > > > > we support primary/unique key constraints and we could add
> foreign
> >> > keys
> >> > > > as
> >> > > > > an extension, once our transactions can span tables.
> >> > > > > e) *table services*: a hudi pipeline today is self-managing -
> >> sizes
> >> > > > files,
> >> > > > > cleans, compacts, clusters data, bootstraps existing data - all
> >> these
> >> > > > > actions working off each other without blocking one another.
> (for
> >> > most
> >> > > > > parts).
> >> > > > > f) *data services*: we also have higher level functionality with
> >> > > > > deltastreamer sources (scalable DFS listing source, Kafka,
> Pulsar
> >> is
> >> > > > > coming, ...and more), incremental ETL support, de-duplication,
> >> commit
> >> > > > > callbacks, pre-commit validations are coming, error tables have
> >> been
> >> > > > > proposed. I could also envision us building towards streaming
> >> egress,
> >> > > > data
> >> > > > > monitoring.
> >> > > > >
> >> > > > > I also think we should build the following (subject to separate
> >> > > > > DISCUSS/RFCs)
> >> > > > >
> >> > > > > g) *caching service*: Hudi specific caching service that can
> hold
> >> > > mutable
> >> > > > 

Re: [DISCUSS] Hudi is the data lake platform

2021-07-30 Thread vino yang
+1

Pratyaksh Sharma  于2021年7月30日周五 上午1:47写道:

> Guess we should rebrand Hudi on README.md file as well -
> https://github.com/apache/hudi#readme?
>
> This page still mentions the following -
>
> "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> Incrementals. Hudi manages the storage of large analytical datasets on
> DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
>
> On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar  wrote:
>
>> Thanks Vino! Got a bunch of emoticons on the PR as well.
>>
>> Will land this monday, giving it more time over the weekend as well.
>>
>>
>> On Wed, Jul 21, 2021 at 7:36 PM vino yang  wrote:
>>
>> > Thanks vc
>> >
>> > Very good blog, in-depth and forward-looking. Learned!
>> >
>> > Best,
>> > Vino
>> >
>> > Vinoth Chandar  于2021年7月22日周四 上午3:58写道:
>> >
>> > > Expanding to users@ as well.
>> > >
>> > > Hi all,
>> > >
>> > > Since this discussion, I started to pen down a coherent strategy and
>> > convey
>> > > these ideas via a blog post.
>> > > I have also done my own research, talked to (ex)colleagues I respect
>> to
>> > get
>> > > their take and refine it.
>> > >
>> > > Here's a blog that hopefully explains this vision.
>> > >
>> > > https://github.com/apache/hudi/pull/3322
>> > >
>> > > Look forward to your feedback on the PR. We are hoping to land this
>> early
>> > > next week, if everyone is aligned.
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:
>> > >
>> > > > +1 , Cannot agree more.
>> > > >  *aux metadata* and metatable, can make hudi have large preformance
>> > > > optimization on query end.
>> > > > Can continuous develop.
>> > > > cache service may the necessary component in cloud native
>> environment.
>> > > >
>> > > > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
>> > > > > Hello all,
>> > > > >
>> > > > > Reading one more article today, positioning Hudi, as just a table
>> > > format,
>> > > > > made me wonder, if we have done enough justice in explaining what
>> we
>> > > have
>> > > > > built together here.
>> > > > > I tend to think of Hudi as the data lake platform, which has the
>> > > > following
>> > > > > components, of which - one if a table format, one is a
>> transactional
>> > > > > storage layer.
>> > > > > But the whole stack we have is definitely worth more than the sum
>> of
>> > > all
>> > > > > the parts IMO (speaking from my own experience from the past 10+
>> > years
>> > > of
>> > > > > open source software dev).
>> > > > >
>> > > > > Here's what we have built so far.
>> > > > >
>> > > > > a) *table format* : something that stores table schema, a metadata
>> > > table
>> > > > > that stores file listing today, and being extended to store column
>> > > ranges
>> > > > > and more in the future (RFC-27)
>> > > > > b) *aux metadata* : bloom filters, external record level indexes
>> > today,
>> > > > > bitmaps/interval trees and other advanced on-disk data structures
>> > > > tomorrow
>> > > > > c) *concurrency control* : we always supported MVCC based log
>> based
>> > > > > concurrency (serialize writes into a time ordered log), and we now
>> > also
>> > > > > have OCC for batch merge workloads with 0.8.0. We will have
>> > multi-table
>> > > > and
>> > > > > fully non-blocking writers soon (see future work section of
>> RFC-22)
>> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
>> > Hudi,
>> > > > but
>> > > > > we support primary/unique key constraints and we could add foreign
>> > keys
>> > > > as
>> > > > > an extension, once our transactions can span tables.
>> > > > > e) *table services*: a hudi pipeline today is self-managing -
>> sizes
>> > > > files,
>> > > > > cleans, compacts, clusters data, bootstraps existing data - all
>> these
>> > > > > actions working off each other without blocking one another. (for
>> > most
>> > > > > parts).
>> > > > > f) *data services*: we also have higher level functionality with
>> > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar
>> is
>> > > > > coming, ...and more), incremental ETL support, de-duplication,
>> commit
>> > > > > callbacks, pre-commit validations are coming, error tables have
>> been
>> > > > > proposed. I could also envision us building towards streaming
>> egress,
>> > > > data
>> > > > > monitoring.
>> > > > >
>> > > > > I also think we should build the following (subject to separate
>> > > > > DISCUSS/RFCs)
>> > > > >
>> > > > > g) *caching service*: Hudi specific caching service that can hold
>> > > mutable
>> > > > > data and serve oft-queried data across engines.
>> > > > > h) t*imeline metaserver:* We already run a metaserver in spark
>> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table.
>> Let's
>> > > > turn
>> > > > > it into a scalable, sharded metastore, that all engines can use to
>> > > obtain
>> > > > > any metadata.
>> > > > >
>> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
>> > oppose

Re: [DISCUSS] Hudi is the data lake platform

2021-07-29 Thread Pratyaksh Sharma
Guess we should rebrand Hudi on README.md file as well -
https://github.com/apache/hudi#readme?

This page still mentions the following -

"Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
Incrementals. Hudi manages the storage of large analytical datasets on DFS
(Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."

On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar  wrote:

> Thanks Vino! Got a bunch of emoticons on the PR as well.
>
> Will land this monday, giving it more time over the weekend as well.
>
>
> On Wed, Jul 21, 2021 at 7:36 PM vino yang  wrote:
>
> > Thanks vc
> >
> > Very good blog, in-depth and forward-looking. Learned!
> >
> > Best,
> > Vino
> >
> > Vinoth Chandar  于2021年7月22日周四 上午3:58写道:
> >
> > > Expanding to users@ as well.
> > >
> > > Hi all,
> > >
> > > Since this discussion, I started to pen down a coherent strategy and
> > convey
> > > these ideas via a blog post.
> > > I have also done my own research, talked to (ex)colleagues I respect to
> > get
> > > their take and refine it.
> > >
> > > Here's a blog that hopefully explains this vision.
> > >
> > > https://github.com/apache/hudi/pull/3322
> > >
> > > Look forward to your feedback on the PR. We are hoping to land this
> early
> > > next week, if everyone is aligned.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:
> > >
> > > > +1 , Cannot agree more.
> > > >  *aux metadata* and metatable, can make hudi have large preformance
> > > > optimization on query end.
> > > > Can continuous develop.
> > > > cache service may the necessary component in cloud native
> environment.
> > > >
> > > > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> > > > > Hello all,
> > > > >
> > > > > Reading one more article today, positioning Hudi, as just a table
> > > format,
> > > > > made me wonder, if we have done enough justice in explaining what
> we
> > > have
> > > > > built together here.
> > > > > I tend to think of Hudi as the data lake platform, which has the
> > > > following
> > > > > components, of which - one if a table format, one is a
> transactional
> > > > > storage layer.
> > > > > But the whole stack we have is definitely worth more than the sum
> of
> > > all
> > > > > the parts IMO (speaking from my own experience from the past 10+
> > years
> > > of
> > > > > open source software dev).
> > > > >
> > > > > Here's what we have built so far.
> > > > >
> > > > > a) *table format* : something that stores table schema, a metadata
> > > table
> > > > > that stores file listing today, and being extended to store column
> > > ranges
> > > > > and more in the future (RFC-27)
> > > > > b) *aux metadata* : bloom filters, external record level indexes
> > today,
> > > > > bitmaps/interval trees and other advanced on-disk data structures
> > > > tomorrow
> > > > > c) *concurrency control* : we always supported MVCC based log based
> > > > > concurrency (serialize writes into a time ordered log), and we now
> > also
> > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > multi-table
> > > > and
> > > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> > Hudi,
> > > > but
> > > > > we support primary/unique key constraints and we could add foreign
> > keys
> > > > as
> > > > > an extension, once our transactions can span tables.
> > > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > > files,
> > > > > cleans, compacts, clusters data, bootstraps existing data - all
> these
> > > > > actions working off each other without blocking one another. (for
> > most
> > > > > parts).
> > > > > f) *data services*: we also have higher level functionality with
> > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar
> is
> > > > > coming, ...and more), incremental ETL support, de-duplication,
> commit
> > > > > callbacks, pre-commit validations are coming, error tables have
> been
> > > > > proposed. I could also envision us building towards streaming
> egress,
> > > > data
> > > > > monitoring.
> > > > >
> > > > > I also think we should build the following (subject to separate
> > > > > DISCUSS/RFCs)
> > > > >
> > > > > g) *caching service*: Hudi specific caching service that can hold
> > > mutable
> > > > > data and serve oft-queried data across engines.
> > > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table.
> Let's
> > > > turn
> > > > > it into a scalable, sharded metastore, that all engines can use to
> > > obtain
> > > > > any metadata.
> > > > >
> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> > opposed
> > > to
> > > > > "ingests & manages storage of large analytical datasets over DFS
> > (hdfs
> > > or
> > > > > cloud stores)." and convey the scope of our vision,
> > > > > given we have already been bu

Re: [DISCUSS] Hudi is the data lake platform

2021-07-23 Thread Vinoth Chandar
Thanks Vino! Got a bunch of emoticons on the PR as well.

Will land this monday, giving it more time over the weekend as well.


On Wed, Jul 21, 2021 at 7:36 PM vino yang  wrote:

> Thanks vc
>
> Very good blog, in-depth and forward-looking. Learned!
>
> Best,
> Vino
>
> Vinoth Chandar  于2021年7月22日周四 上午3:58写道:
>
> > Expanding to users@ as well.
> >
> > Hi all,
> >
> > Since this discussion, I started to pen down a coherent strategy and
> convey
> > these ideas via a blog post.
> > I have also done my own research, talked to (ex)colleagues I respect to
> get
> > their take and refine it.
> >
> > Here's a blog that hopefully explains this vision.
> >
> > https://github.com/apache/hudi/pull/3322
> >
> > Look forward to your feedback on the PR. We are hoping to land this early
> > next week, if everyone is aligned.
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:
> >
> > > +1 , Cannot agree more.
> > >  *aux metadata* and metatable, can make hudi have large preformance
> > > optimization on query end.
> > > Can continuous develop.
> > > cache service may the necessary component in cloud native environment.
> > >
> > > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> > > > Hello all,
> > > >
> > > > Reading one more article today, positioning Hudi, as just a table
> > format,
> > > > made me wonder, if we have done enough justice in explaining what we
> > have
> > > > built together here.
> > > > I tend to think of Hudi as the data lake platform, which has the
> > > following
> > > > components, of which - one if a table format, one is a transactional
> > > > storage layer.
> > > > But the whole stack we have is definitely worth more than the sum of
> > all
> > > > the parts IMO (speaking from my own experience from the past 10+
> years
> > of
> > > > open source software dev).
> > > >
> > > > Here's what we have built so far.
> > > >
> > > > a) *table format* : something that stores table schema, a metadata
> > table
> > > > that stores file listing today, and being extended to store column
> > ranges
> > > > and more in the future (RFC-27)
> > > > b) *aux metadata* : bloom filters, external record level indexes
> today,
> > > > bitmaps/interval trees and other advanced on-disk data structures
> > > tomorrow
> > > > c) *concurrency control* : we always supported MVCC based log based
> > > > concurrency (serialize writes into a time ordered log), and we now
> also
> > > > have OCC for batch merge workloads with 0.8.0. We will have
> multi-table
> > > and
> > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> Hudi,
> > > but
> > > > we support primary/unique key constraints and we could add foreign
> keys
> > > as
> > > > an extension, once our transactions can span tables.
> > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > files,
> > > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > > actions working off each other without blocking one another. (for
> most
> > > > parts).
> > > > f) *data services*: we also have higher level functionality with
> > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > > callbacks, pre-commit validations are coming, error tables have been
> > > > proposed. I could also envision us building towards streaming egress,
> > > data
> > > > monitoring.
> > > >
> > > > I also think we should build the following (subject to separate
> > > > DISCUSS/RFCs)
> > > >
> > > > g) *caching service*: Hudi specific caching service that can hold
> > mutable
> > > > data and serve oft-queried data across engines.
> > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > > turn
> > > > it into a scalable, sharded metastore, that all engines can use to
> > obtain
> > > > any metadata.
> > > >
> > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> opposed
> > to
> > > > "ingests & manages storage of large analytical datasets over DFS
> (hdfs
> > or
> > > > cloud stores)." and convey the scope of our vision,
> > > > given we have already been building towards that. It would also
> provide
> > > new
> > > > contributors a good lens to look at the project from.
> > > >
> > > > (This is very similar to for e.g, the evolution of Kafka from a
> pub-sub
> > > > system, to an event streaming platform - with addition of
> > > > MirrorMaker/Connect etc. )
> > > >
> > > > Please share your thoughts!
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>


Re: [DISCUSS] Hudi is the data lake platform

2021-07-21 Thread vino yang
Thanks vc

Very good blog, in-depth and forward-looking. Learned!

Best,
Vino

Vinoth Chandar  于2021年7月22日周四 上午3:58写道:

> Expanding to users@ as well.
>
> Hi all,
>
> Since this discussion, I started to pen down a coherent strategy and convey
> these ideas via a blog post.
> I have also done my own research, talked to (ex)colleagues I respect to get
> their take and refine it.
>
> Here's a blog that hopefully explains this vision.
>
> https://github.com/apache/hudi/pull/3322
>
> Look forward to your feedback on the PR. We are hoping to land this early
> next week, if everyone is aligned.
>
> Thanks
> Vinoth
>
> On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:
>
> > +1 , Cannot agree more.
> >  *aux metadata* and metatable, can make hudi have large preformance
> > optimization on query end.
> > Can continuous develop.
> > cache service may the necessary component in cloud native environment.
> >
> > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> > > Hello all,
> > >
> > > Reading one more article today, positioning Hudi, as just a table
> format,
> > > made me wonder, if we have done enough justice in explaining what we
> have
> > > built together here.
> > > I tend to think of Hudi as the data lake platform, which has the
> > following
> > > components, of which - one if a table format, one is a transactional
> > > storage layer.
> > > But the whole stack we have is definitely worth more than the sum of
> all
> > > the parts IMO (speaking from my own experience from the past 10+ years
> of
> > > open source software dev).
> > >
> > > Here's what we have built so far.
> > >
> > > a) *table format* : something that stores table schema, a metadata
> table
> > > that stores file listing today, and being extended to store column
> ranges
> > > and more in the future (RFC-27)
> > > b) *aux metadata* : bloom filters, external record level indexes today,
> > > bitmaps/interval trees and other advanced on-disk data structures
> > tomorrow
> > > c) *concurrency control* : we always supported MVCC based log based
> > > concurrency (serialize writes into a time ordered log), and we now also
> > > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> > and
> > > fully non-blocking writers soon (see future work section of RFC-22)
> > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> > but
> > > we support primary/unique key constraints and we could add foreign keys
> > as
> > > an extension, once our transactions can span tables.
> > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > files,
> > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > actions working off each other without blocking one another. (for most
> > > parts).
> > > f) *data services*: we also have higher level functionality with
> > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > callbacks, pre-commit validations are coming, error tables have been
> > > proposed. I could also envision us building towards streaming egress,
> > data
> > > monitoring.
> > >
> > > I also think we should build the following (subject to separate
> > > DISCUSS/RFCs)
> > >
> > > g) *caching service*: Hudi specific caching service that can hold
> mutable
> > > data and serve oft-queried data across engines.
> > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > turn
> > > it into a scalable, sharded metastore, that all engines can use to
> obtain
> > > any metadata.
> > >
> > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed
> to
> > > "ingests & manages storage of large analytical datasets over DFS (hdfs
> or
> > > cloud stores)." and convey the scope of our vision,
> > > given we have already been building towards that. It would also provide
> > new
> > > contributors a good lens to look at the project from.
> > >
> > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > > system, to an event streaming platform - with addition of
> > > MirrorMaker/Connect etc. )
> > >
> > > Please share your thoughts!
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>


Re: [DISCUSS] Hudi is the data lake platform

2021-07-21 Thread Vinoth Chandar
Expanding to users@ as well.

Hi all,

Since this discussion, I started to pen down a coherent strategy and convey
these ideas via a blog post.
I have also done my own research, talked to (ex)colleagues I respect to get
their take and refine it.

Here's a blog that hopefully explains this vision.

https://github.com/apache/hudi/pull/3322

Look forward to your feedback on the PR. We are hoping to land this early
next week, if everyone is aligned.

Thanks
Vinoth

On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:

> +1 , Cannot agree more.
>  *aux metadata* and metatable, can make hudi have large preformance
> optimization on query end.
> Can continuous develop.
> cache service may the necessary component in cloud native environment.
>
> On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> > Hello all,
> >
> > Reading one more article today, positioning Hudi, as just a table format,
> > made me wonder, if we have done enough justice in explaining what we have
> > built together here.
> > I tend to think of Hudi as the data lake platform, which has the
> following
> > components, of which - one if a table format, one is a transactional
> > storage layer.
> > But the whole stack we have is definitely worth more than the sum of all
> > the parts IMO (speaking from my own experience from the past 10+ years of
> > open source software dev).
> >
> > Here's what we have built so far.
> >
> > a) *table format* : something that stores table schema, a metadata table
> > that stores file listing today, and being extended to store column ranges
> > and more in the future (RFC-27)
> > b) *aux metadata* : bloom filters, external record level indexes today,
> > bitmaps/interval trees and other advanced on-disk data structures
> tomorrow
> > c) *concurrency control* : we always supported MVCC based log based
> > concurrency (serialize writes into a time ordered log), and we now also
> > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> and
> > fully non-blocking writers soon (see future work section of RFC-22)
> > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> but
> > we support primary/unique key constraints and we could add foreign keys
> as
> > an extension, once our transactions can span tables.
> > e) *table services*: a hudi pipeline today is self-managing - sizes
> files,
> > cleans, compacts, clusters data, bootstraps existing data - all these
> > actions working off each other without blocking one another. (for most
> > parts).
> > f) *data services*: we also have higher level functionality with
> > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > coming, ...and more), incremental ETL support, de-duplication, commit
> > callbacks, pre-commit validations are coming, error tables have been
> > proposed. I could also envision us building towards streaming egress,
> data
> > monitoring.
> >
> > I also think we should build the following (subject to separate
> > DISCUSS/RFCs)
> >
> > g) *caching service*: Hudi specific caching service that can hold mutable
> > data and serve oft-queried data across engines.
> > h) t*imeline metaserver:* We already run a metaserver in spark
> > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> turn
> > it into a scalable, sharded metastore, that all engines can use to obtain
> > any metadata.
> >
> > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
> > "ingests & manages storage of large analytical datasets over DFS (hdfs or
> > cloud stores)." and convey the scope of our vision,
> > given we have already been building towards that. It would also provide
> new
> > contributors a good lens to look at the project from.
> >
> > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > system, to an event streaming platform - with addition of
> > MirrorMaker/Connect etc. )
> >
> > Please share your thoughts!
> >
> > Thanks
> > Vinoth
> >
>


Re: [DISCUSS] Hudi is the data lake platform

2021-04-21 Thread wei li
+1 , Cannot agree more.
 *aux metadata* and metatable, can make hudi have large preformance 
optimization on query end. 
Can continuous develop.
cache service may the necessary component in cloud native environment.

On 2021/04/13 05:29:55, Vinoth Chandar  wrote: 
> Hello all,
> 
> Reading one more article today, positioning Hudi, as just a table format,
> made me wonder, if we have done enough justice in explaining what we have
> built together here.
> I tend to think of Hudi as the data lake platform, which has the following
> components, of which - one if a table format, one is a transactional
> storage layer.
> But the whole stack we have is definitely worth more than the sum of all
> the parts IMO (speaking from my own experience from the past 10+ years of
> open source software dev).
> 
> Here's what we have built so far.
> 
> a) *table format* : something that stores table schema, a metadata table
> that stores file listing today, and being extended to store column ranges
> and more in the future (RFC-27)
> b) *aux metadata* : bloom filters, external record level indexes today,
> bitmaps/interval trees and other advanced on-disk data structures tomorrow
> c) *concurrency control* : we always supported MVCC based log based
> concurrency (serialize writes into a time ordered log), and we now also
> have OCC for batch merge workloads with 0.8.0. We will have multi-table and
> fully non-blocking writers soon (see future work section of RFC-22)
> d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but
> we support primary/unique key constraints and we could add foreign keys as
> an extension, once our transactions can span tables.
> e) *table services*: a hudi pipeline today is self-managing - sizes files,
> cleans, compacts, clusters data, bootstraps existing data - all these
> actions working off each other without blocking one another. (for most
> parts).
> f) *data services*: we also have higher level functionality with
> deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> coming, ...and more), incremental ETL support, de-duplication, commit
> callbacks, pre-commit validations are coming, error tables have been
> proposed. I could also envision us building towards streaming egress, data
> monitoring.
> 
> I also think we should build the following (subject to separate
> DISCUSS/RFCs)
> 
> g) *caching service*: Hudi specific caching service that can hold mutable
> data and serve oft-queried data across engines.
> h) t*imeline metaserver:* We already run a metaserver in spark
> writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn
> it into a scalable, sharded metastore, that all engines can use to obtain
> any metadata.
> 
> To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
> "ingests & manages storage of large analytical datasets over DFS (hdfs or
> cloud stores)." and convey the scope of our vision,
> given we have already been building towards that. It would also provide new
> contributors a good lens to look at the project from.
> 
> (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> system, to an event streaming platform - with addition of
> MirrorMaker/Connect etc. )
> 
> Please share your thoughts!
> 
> Thanks
> Vinoth
> 


Re: [DISCUSS] Hudi is the data lake platform

2021-04-19 Thread Vinoth Chandar
Looks like we have consensus here!  Will share the blog PR here once ready.

Thanks all!

On Fri, Apr 16, 2021 at 8:43 PM Sivabalan  wrote:

> totally +1 on clarifying Hudi's vision.
>
> On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal 
> wrote:
>
> > +1
> >
> > I also believe Hudi is a Data Platform technology providing many
> different
> > functionalities to build modern data lakes, Hudi's table format being
> just
> > one of them. I've been using this perspective in some of the conference
> > talks already ;)
> > With this rebranding (and hopefully some code/package structuring down
> the
> > road..), it's easier for us to communicate the value add of Hudi and its
> > associated features and generate interest for future contributors.
> >
> > Thanks,
> > Nishith
> >
> >
> > On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar 
> wrote:
> >
> > > Thanks everyone for the feedback, so far!
> > >
> > > On the incremental aspects, that's actually Hudi's core design
> > > differentiation. While I believe the ETL today is still largely batch
> > > oriented, the way forward for everyone's
> > > benefit is indeed - incremental processing. We have already taken a
> giant
> > > step here for e.g in making raw data ingestion fully incremental using
> > > deltastreamer. We should keep working to crack incremental ETL at
> large.
> > > 100% with your line of thinking!
> > >
> > > It's been in my head for four full years now! :)
> > >
> > >
> >
> https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
> > >
> > > I have started drafting a blog/PR along these lines already. I will
> make
> > it
> > > more final and share here, as we wait couple more days for more
> feedback!
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan 
> wrote:
> > >
> > > > +1 for the vision, personally i'm promising the incremental ETL part,
> > > with
> > > > engine like Apache Flink we can do intermediate aggregation in
> > streaming
> > > > style.
> > > >
> > > > Best,
> > > > Danny Chan
> > > >
> > > > leesf  于2021年4月14日周三 上午9:52写道:
> > > >
> > > > > +1. Cool and promising.
> > > > >
> > > > > Mehrotra, Udit  于2021年4月14日周三 上午2:57写道:
> > > > >
> > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table
> format"
> > > and
> > > > > we
> > > > > > need to do justice to all the cool auxiliary features/services we
> > > have
> > > > > > built.
> > > > > >
> > > > > > Also, timeline metadata service in particular would be a really
> big
> > > win
> > > > > if
> > > > > > we move towards something like that.
> > > > > >
> > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma"  >
> > > > wrote:
> > > > > >
> > > > > > CAUTION: This email originated from outside of the
> > organization.
> > > Do
> > > > > > not click links or open attachments unless you can confirm the
> > sender
> > > > and
> > > > > > know the content is safe.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Definitely we are doing much more than only ingesting and
> > > managing
> > > > > data
> > > > > > over DFS.
> > > > > >
> > > > > > +1 from my side as well. :)
> > > > > >
> > > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <
> > susudo...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I love this rebranding. Totally agree. +1
> > > > > > >
> > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> > > > > > xu.shiyan.raym...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 The vision looks fantastic.
> > > > > > > >
> > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <
> gar...@apache.org
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Awesome summary of Hudi! +1 as well.
> > > > > > > > >
> > > > > > > > > Gary Li
> > > > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> > > > > > rubenssoto2...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > Excellent, I agree
> > > > > > > > > >
> > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> > > > > > yanghua1...@gmail.com>
> > > > > > > > > escreveu:
> > > > > > > > > >
> > > > > > > > > > > +1 Excited by this new vision!
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Vino
> > > > > > > > > > >
> > > > > > > > > > > Dianjin Wang 
> > > > > 于2021年4月13日周二
> > > > > > > > 下午3:53写道:
> > > > > > > > > > >
> > > > > > > > > > > > +1  The new brand is straightforward, a better
> > > > > description
> > > > > > of
> > > > > > > Hudi.
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Dianjin Wang
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > > > > > > > bhavanisud...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > +1 . Cannot agree more. I think this makes
> tota

Re: [DISCUSS] Hudi is the data lake platform

2021-04-16 Thread Sivabalan
totally +1 on clarifying Hudi's vision.

On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal  wrote:

> +1
>
> I also believe Hudi is a Data Platform technology providing many different
> functionalities to build modern data lakes, Hudi's table format being just
> one of them. I've been using this perspective in some of the conference
> talks already ;)
> With this rebranding (and hopefully some code/package structuring down the
> road..), it's easier for us to communicate the value add of Hudi and its
> associated features and generate interest for future contributors.
>
> Thanks,
> Nishith
>
>
> On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar  wrote:
>
> > Thanks everyone for the feedback, so far!
> >
> > On the incremental aspects, that's actually Hudi's core design
> > differentiation. While I believe the ETL today is still largely batch
> > oriented, the way forward for everyone's
> > benefit is indeed - incremental processing. We have already taken a giant
> > step here for e.g in making raw data ingestion fully incremental using
> > deltastreamer. We should keep working to crack incremental ETL at large.
> > 100% with your line of thinking!
> >
> > It's been in my head for four full years now! :)
> >
> >
> https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
> >
> > I have started drafting a blog/PR along these lines already. I will make
> it
> > more final and share here, as we wait couple more days for more feedback!
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan  wrote:
> >
> > > +1 for the vision, personally i'm promising the incremental ETL part,
> > with
> > > engine like Apache Flink we can do intermediate aggregation in
> streaming
> > > style.
> > >
> > > Best,
> > > Danny Chan
> > >
> > > leesf  于2021年4月14日周三 上午9:52写道:
> > >
> > > > +1. Cool and promising.
> > > >
> > > > Mehrotra, Udit  于2021年4月14日周三 上午2:57写道:
> > > >
> > > > > Agree with the rebranding Vinoth. Hudi is not just a "table format"
> > and
> > > > we
> > > > > need to do justice to all the cool auxiliary features/services we
> > have
> > > > > built.
> > > > >
> > > > > Also, timeline metadata service in particular would be a really big
> > win
> > > > if
> > > > > we move towards something like that.
> > > > >
> > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" 
> > > wrote:
> > > > >
> > > > > CAUTION: This email originated from outside of the
> organization.
> > Do
> > > > > not click links or open attachments unless you can confirm the
> sender
> > > and
> > > > > know the content is safe.
> > > > >
> > > > >
> > > > >
> > > > > Definitely we are doing much more than only ingesting and
> > managing
> > > > data
> > > > > over DFS.
> > > > >
> > > > > +1 from my side as well. :)
> > > > >
> > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <
> susudo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I love this rebranding. Totally agree. +1
> > > > > >
> > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> > > > > xu.shiyan.raym...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 The vision looks fantastic.
> > > > > > >
> > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li  >
> > > > wrote:
> > > > > > >
> > > > > > > > Awesome summary of Hudi! +1 as well.
> > > > > > > >
> > > > > > > > Gary Li
> > > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> > > > > rubenssoto2...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > Excellent, I agree
> > > > > > > > >
> > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> > > > > yanghua1...@gmail.com>
> > > > > > > > escreveu:
> > > > > > > > >
> > > > > > > > > > +1 Excited by this new vision!
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Vino
> > > > > > > > > >
> > > > > > > > > > Dianjin Wang 
> > > > 于2021年4月13日周二
> > > > > > > 下午3:53写道:
> > > > > > > > > >
> > > > > > > > > > > +1  The new brand is straightforward, a better
> > > > description
> > > > > of
> > > > > > Hudi.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Dianjin Wang
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > > > > > > bhavanisud...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1 . Cannot agree more. I think this makes total
> > > sense
> > > > > and will
> > > > > > > > provide
> > > > > > > > > > > for
> > > > > > > > > > > > a much better representation of the project.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > > > > > > vin...@apache.org
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hello all,
> > > > > > > > > > > > >
> > > > > > 

Re: [DISCUSS] Hudi is the data lake platform

2021-04-14 Thread nishith agarwal
+1

I also believe Hudi is a Data Platform technology providing many different
functionalities to build modern data lakes, Hudi's table format being just
one of them. I've been using this perspective in some of the conference
talks already ;)
With this rebranding (and hopefully some code/package structuring down the
road..), it's easier for us to communicate the value add of Hudi and its
associated features and generate interest for future contributors.

Thanks,
Nishith


On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar  wrote:

> Thanks everyone for the feedback, so far!
>
> On the incremental aspects, that's actually Hudi's core design
> differentiation. While I believe the ETL today is still largely batch
> oriented, the way forward for everyone's
> benefit is indeed - incremental processing. We have already taken a giant
> step here for e.g in making raw data ingestion fully incremental using
> deltastreamer. We should keep working to crack incremental ETL at large.
> 100% with your line of thinking!
>
> It's been in my head for four full years now! :)
>
> https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
>
> I have started drafting a blog/PR along these lines already. I will make it
> more final and share here, as we wait couple more days for more feedback!
>
> Thanks
> Vinoth
>
> On Tue, Apr 13, 2021 at 7:01 PM Danny Chan  wrote:
>
> > +1 for the vision, personally i'm promising the incremental ETL part,
> with
> > engine like Apache Flink we can do intermediate aggregation in streaming
> > style.
> >
> > Best,
> > Danny Chan
> >
> > leesf  于2021年4月14日周三 上午9:52写道:
> >
> > > +1. Cool and promising.
> > >
> > > Mehrotra, Udit  于2021年4月14日周三 上午2:57写道:
> > >
> > > > Agree with the rebranding Vinoth. Hudi is not just a "table format"
> and
> > > we
> > > > need to do justice to all the cool auxiliary features/services we
> have
> > > > built.
> > > >
> > > > Also, timeline metadata service in particular would be a really big
> win
> > > if
> > > > we move towards something like that.
> > > >
> > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" 
> > wrote:
> > > >
> > > > CAUTION: This email originated from outside of the organization.
> Do
> > > > not click links or open attachments unless you can confirm the sender
> > and
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > > Definitely we are doing much more than only ingesting and
> managing
> > > data
> > > > over DFS.
> > > >
> > > > +1 from my side as well. :)
> > > >
> > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong 
> > > > wrote:
> > > >
> > > > > I love this rebranding. Totally agree. +1
> > > > >
> > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> > > > xu.shiyan.raym...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > +1 The vision looks fantastic.
> > > > > >
> > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li 
> > > wrote:
> > > > > >
> > > > > > > Awesome summary of Hudi! +1 as well.
> > > > > > >
> > > > > > > Gary Li
> > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> > > > rubenssoto2...@gmail.com>
> > > > > > > wrote:
> > > > > > > > Excellent, I agree
> > > > > > > >
> > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> > > > yanghua1...@gmail.com>
> > > > > > > escreveu:
> > > > > > > >
> > > > > > > > > +1 Excited by this new vision!
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Vino
> > > > > > > > >
> > > > > > > > > Dianjin Wang 
> > > 于2021年4月13日周二
> > > > > > 下午3:53写道:
> > > > > > > > >
> > > > > > > > > > +1  The new brand is straightforward, a better
> > > description
> > > > of
> > > > > Hudi.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Dianjin Wang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > > > > > bhavanisud...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 . Cannot agree more. I think this makes total
> > sense
> > > > and will
> > > > > > > provide
> > > > > > > > > > for
> > > > > > > > > > > a much better representation of the project.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > > > > > vin...@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hello all,
> > > > > > > > > > > >
> > > > > > > > > > > > Reading one more article today, positioning Hudi,
> > as
> > > > just a
> > > > > > table
> > > > > > > > > > format,
> > > > > > > > > > > > made me wonder, if we have done enough justice in
> > > > explaining
> > > > > > > what we
> > > > > > > > > > have
> > > > > > > > > > > > built together here.
> > > > > > > > > > > > I tend to think of Hudi as the data lake
> platfor

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Vinoth Chandar
Thanks everyone for the feedback, so far!

On the incremental aspects, that's actually Hudi's core design
differentiation. While I believe the ETL today is still largely batch
oriented, the way forward for everyone's
benefit is indeed - incremental processing. We have already taken a giant
step here for e.g in making raw data ingestion fully incremental using
deltastreamer. We should keep working to crack incremental ETL at large.
100% with your line of thinking!

It's been in my head for four full years now! :)
https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/

I have started drafting a blog/PR along these lines already. I will make it
more final and share here, as we wait couple more days for more feedback!

Thanks
Vinoth

On Tue, Apr 13, 2021 at 7:01 PM Danny Chan  wrote:

> +1 for the vision, personally i'm promising the incremental ETL part, with
> engine like Apache Flink we can do intermediate aggregation in streaming
> style.
>
> Best,
> Danny Chan
>
> leesf  于2021年4月14日周三 上午9:52写道:
>
> > +1. Cool and promising.
> >
> > Mehrotra, Udit  于2021年4月14日周三 上午2:57写道:
> >
> > > Agree with the rebranding Vinoth. Hudi is not just a "table format" and
> > we
> > > need to do justice to all the cool auxiliary features/services we have
> > > built.
> > >
> > > Also, timeline metadata service in particular would be a really big win
> > if
> > > we move towards something like that.
> > >
> > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" 
> wrote:
> > >
> > > CAUTION: This email originated from outside of the organization. Do
> > > not click links or open attachments unless you can confirm the sender
> and
> > > know the content is safe.
> > >
> > >
> > >
> > > Definitely we are doing much more than only ingesting and managing
> > data
> > > over DFS.
> > >
> > > +1 from my side as well. :)
> > >
> > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong 
> > > wrote:
> > >
> > > > I love this rebranding. Totally agree. +1
> > > >
> > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> > > xu.shiyan.raym...@gmail.com>
> > > > wrote:
> > > >
> > > > > +1 The vision looks fantastic.
> > > > >
> > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li 
> > wrote:
> > > > >
> > > > > > Awesome summary of Hudi! +1 as well.
> > > > > >
> > > > > > Gary Li
> > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> > > rubenssoto2...@gmail.com>
> > > > > > wrote:
> > > > > > > Excellent, I agree
> > > > > > >
> > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> > > yanghua1...@gmail.com>
> > > > > > escreveu:
> > > > > > >
> > > > > > > > +1 Excited by this new vision!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Vino
> > > > > > > >
> > > > > > > > Dianjin Wang 
> > 于2021年4月13日周二
> > > > > 下午3:53写道:
> > > > > > > >
> > > > > > > > > +1  The new brand is straightforward, a better
> > description
> > > of
> > > > Hudi.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Dianjin Wang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > > > > bhavanisud...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 . Cannot agree more. I think this makes total
> sense
> > > and will
> > > > > > provide
> > > > > > > > > for
> > > > > > > > > > a much better representation of the project.
> > > > > > > > > >
> > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > > > > vin...@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello all,
> > > > > > > > > > >
> > > > > > > > > > > Reading one more article today, positioning Hudi,
> as
> > > just a
> > > > > table
> > > > > > > > > format,
> > > > > > > > > > > made me wonder, if we have done enough justice in
> > > explaining
> > > > > > what we
> > > > > > > > > have
> > > > > > > > > > > built together here.
> > > > > > > > > > > I tend to think of Hudi as the data lake platform,
> > > which has
> > > > > the
> > > > > > > > > > following
> > > > > > > > > > > components, of which - one if a table format, one
> is
> > a
> > > > > > transactional
> > > > > > > > > > > storage layer.
> > > > > > > > > > > But the whole stack we have is definitely worth
> more
> > > than the
> > > > > > sum of
> > > > > > > > > all
> > > > > > > > > > > the parts IMO (speaking from my own experience from
> > > the past
> > > > > 10+
> > > > > > > > years
> > > > > > > > > of
> > > > > > > > > > > open source software dev).
> > > > > > > > > > >
> > > > > > > > > > > Here's what we have built so far.
> > > > > > > > > > >
> > > > > > > > > > > a) *table format* : something that stores table
> > > schema, a
> > > > > > 

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Danny Chan
+1 for the vision, personally i'm promising the incremental ETL part, with
engine like Apache Flink we can do intermediate aggregation in streaming
style.

Best,
Danny Chan

leesf  于2021年4月14日周三 上午9:52写道:

> +1. Cool and promising.
>
> Mehrotra, Udit  于2021年4月14日周三 上午2:57写道:
>
> > Agree with the rebranding Vinoth. Hudi is not just a "table format" and
> we
> > need to do justice to all the cool auxiliary features/services we have
> > built.
> >
> > Also, timeline metadata service in particular would be a really big win
> if
> > we move towards something like that.
> >
> > On 4/13/21, 11:01 AM, "Pratyaksh Sharma"  wrote:
> >
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open attachments unless you can confirm the sender and
> > know the content is safe.
> >
> >
> >
> > Definitely we are doing much more than only ingesting and managing
> data
> > over DFS.
> >
> > +1 from my side as well. :)
> >
> > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong 
> > wrote:
> >
> > > I love this rebranding. Totally agree. +1
> > >
> > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> > xu.shiyan.raym...@gmail.com>
> > > wrote:
> > >
> > > > +1 The vision looks fantastic.
> > > >
> > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li 
> wrote:
> > > >
> > > > > Awesome summary of Hudi! +1 as well.
> > > > >
> > > > > Gary Li
> > > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> > rubenssoto2...@gmail.com>
> > > > > wrote:
> > > > > > Excellent, I agree
> > > > > >
> > > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> > yanghua1...@gmail.com>
> > > > > escreveu:
> > > > > >
> > > > > > > +1 Excited by this new vision!
> > > > > > >
> > > > > > > Best,
> > > > > > > Vino
> > > > > > >
> > > > > > > Dianjin Wang 
> 于2021年4月13日周二
> > > > 下午3:53写道:
> > > > > > >
> > > > > > > > +1  The new brand is straightforward, a better
> description
> > of
> > > Hudi.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Dianjin Wang
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > > > bhavanisud...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 . Cannot agree more. I think this makes total sense
> > and will
> > > > > provide
> > > > > > > > for
> > > > > > > > > a much better representation of the project.
> > > > > > > > >
> > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > > > vin...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello all,
> > > > > > > > > >
> > > > > > > > > > Reading one more article today, positioning Hudi, as
> > just a
> > > > table
> > > > > > > > format,
> > > > > > > > > > made me wonder, if we have done enough justice in
> > explaining
> > > > > what we
> > > > > > > > have
> > > > > > > > > > built together here.
> > > > > > > > > > I tend to think of Hudi as the data lake platform,
> > which has
> > > > the
> > > > > > > > > following
> > > > > > > > > > components, of which - one if a table format, one is
> a
> > > > > transactional
> > > > > > > > > > storage layer.
> > > > > > > > > > But the whole stack we have is definitely worth more
> > than the
> > > > > sum of
> > > > > > > > all
> > > > > > > > > > the parts IMO (speaking from my own experience from
> > the past
> > > > 10+
> > > > > > > years
> > > > > > > > of
> > > > > > > > > > open source software dev).
> > > > > > > > > >
> > > > > > > > > > Here's what we have built so far.
> > > > > > > > > >
> > > > > > > > > > a) *table format* : something that stores table
> > schema, a
> > > > > metadata
> > > > > > > > table
> > > > > > > > > > that stores file listing today, and being extended to
> > store
> > > > > column
> > > > > > > > ranges
> > > > > > > > > > and more in the future (RFC-27)
> > > > > > > > > > b) *aux metadata* : bloom filters, external record
> > level
> > > > indexes
> > > > > > > today,
> > > > > > > > > > bitmaps/interval trees and other advanced on-disk
> data
> > > > structures
> > > > > > > > > tomorrow
> > > > > > > > > > c) *concurrency control* : we always supported MVCC
> > based log
> > > > > based
> > > > > > > > > > concurrency (serialize writes into a time ordered
> > log), and
> > > we
> > > > > now
> > > > > > > also
> > > > > > > > > > have OCC for batch merge workloads with 0.8.0. We
> will
> > have
> > > > > > > multi-table
> > > > > > > > > and
> > > > > > > > > > fully non-blocking writers soon (see future work
> > section of
> > > > > RFC-22)
> > > > > > > > > > d) *updates/deletes* : this is the bread-and-butter
> > use-cas

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread leesf
+1. Cool and promising.

Mehrotra, Udit  于2021年4月14日周三 上午2:57写道:

> Agree with the rebranding Vinoth. Hudi is not just a "table format" and we
> need to do justice to all the cool auxiliary features/services we have
> built.
>
> Also, timeline metadata service in particular would be a really big win if
> we move towards something like that.
>
> On 4/13/21, 11:01 AM, "Pratyaksh Sharma"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Definitely we are doing much more than only ingesting and managing data
> over DFS.
>
> +1 from my side as well. :)
>
> On Tue, Apr 13, 2021 at 10:02 PM Susu Dong 
> wrote:
>
> > I love this rebranding. Totally agree. +1
> >
> > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> xu.shiyan.raym...@gmail.com>
> > wrote:
> >
> > > +1 The vision looks fantastic.
> > >
> > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li  wrote:
> > >
> > > > Awesome summary of Hudi! +1 as well.
> > > >
> > > > Gary Li
> > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> rubenssoto2...@gmail.com>
> > > > wrote:
> > > > > Excellent, I agree
> > > > >
> > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> yanghua1...@gmail.com>
> > > > escreveu:
> > > > >
> > > > > > +1 Excited by this new vision!
> > > > > >
> > > > > > Best,
> > > > > > Vino
> > > > > >
> > > > > > Dianjin Wang  于2021年4月13日周二
> > > 下午3:53写道:
> > > > > >
> > > > > > > +1  The new brand is straightforward, a better description
> of
> > Hudi.
> > > > > > >
> > > > > > > Best,
> > > > > > > Dianjin Wang
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > > bhavanisud...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 . Cannot agree more. I think this makes total sense
> and will
> > > > provide
> > > > > > > for
> > > > > > > > a much better representation of the project.
> > > > > > > >
> > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > > vin...@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello all,
> > > > > > > > >
> > > > > > > > > Reading one more article today, positioning Hudi, as
> just a
> > > table
> > > > > > > format,
> > > > > > > > > made me wonder, if we have done enough justice in
> explaining
> > > > what we
> > > > > > > have
> > > > > > > > > built together here.
> > > > > > > > > I tend to think of Hudi as the data lake platform,
> which has
> > > the
> > > > > > > > following
> > > > > > > > > components, of which - one if a table format, one is a
> > > > transactional
> > > > > > > > > storage layer.
> > > > > > > > > But the whole stack we have is definitely worth more
> than the
> > > > sum of
> > > > > > > all
> > > > > > > > > the parts IMO (speaking from my own experience from
> the past
> > > 10+
> > > > > > years
> > > > > > > of
> > > > > > > > > open source software dev).
> > > > > > > > >
> > > > > > > > > Here's what we have built so far.
> > > > > > > > >
> > > > > > > > > a) *table format* : something that stores table
> schema, a
> > > > metadata
> > > > > > > table
> > > > > > > > > that stores file listing today, and being extended to
> store
> > > > column
> > > > > > > ranges
> > > > > > > > > and more in the future (RFC-27)
> > > > > > > > > b) *aux metadata* : bloom filters, external record
> level
> > > indexes
> > > > > > today,
> > > > > > > > > bitmaps/interval trees and other advanced on-disk data
> > > structures
> > > > > > > > tomorrow
> > > > > > > > > c) *concurrency control* : we always supported MVCC
> based log
> > > > based
> > > > > > > > > concurrency (serialize writes into a time ordered
> log), and
> > we
> > > > now
> > > > > > also
> > > > > > > > > have OCC for batch merge workloads with 0.8.0. We will
> have
> > > > > > multi-table
> > > > > > > > and
> > > > > > > > > fully non-blocking writers soon (see future work
> section of
> > > > RFC-22)
> > > > > > > > > d) *updates/deletes* : this is the bread-and-butter
> use-case
> > > for
> > > > > > Hudi,
> > > > > > > > but
> > > > > > > > > we support primary/unique key constraints and we could
> add
> > > > foreign
> > > > > > keys
> > > > > > > > as
> > > > > > > > > an extension, once our transactions can span tables.
> > > > > > > > > e) *table services*: a hudi pipeline today is
> self-managing -
> > > > sizes
> > > > > > > > files,
> > > > > > > > > cleans, compacts, clusters data, bootstraps existing
> data -
> > all

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Mehrotra, Udit
Agree with the rebranding Vinoth. Hudi is not just a "table format" and we need 
to do justice to all the cool auxiliary features/services we have built.

Also, timeline metadata service in particular would be a really big win if we 
move towards something like that.

On 4/13/21, 11:01 AM, "Pratyaksh Sharma"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Definitely we are doing much more than only ingesting and managing data
over DFS.

+1 from my side as well. :)

On Tue, Apr 13, 2021 at 10:02 PM Susu Dong  wrote:

> I love this rebranding. Totally agree. +1
>
> On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu 
> wrote:
>
> > +1 The vision looks fantastic.
> >
> > On Tue, Apr 13, 2021 at 7:45 AM Gary Li  wrote:
> >
> > > Awesome summary of Hudi! +1 as well.
> > >
> > > Gary Li
> > > On 2021/04/13 14:13:24, Rubens Rodrigues 
> > > wrote:
> > > > Excellent, I agree
> > > >
> > > > Em ter, 13 de abr de 2021 07:23, vino yang 
> > > escreveu:
> > > >
> > > > > +1 Excited by this new vision!
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > Dianjin Wang  于2021年4月13日周二
> > 下午3:53写道:
> > > > >
> > > > > > +1  The new brand is straightforward, a better description of
> Hudi.
> > > > > >
> > > > > > Best,
> > > > > > Dianjin Wang
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > bhavanisud...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 . Cannot agree more. I think this makes total sense and 
will
> > > provide
> > > > > > for
> > > > > > > a much better representation of the project.
> > > > > > >
> > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > vin...@apache.org
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > > Reading one more article today, positioning Hudi, as just a
> > table
> > > > > > format,
> > > > > > > > made me wonder, if we have done enough justice in explaining
> > > what we
> > > > > > have
> > > > > > > > built together here.
> > > > > > > > I tend to think of Hudi as the data lake platform, which has
> > the
> > > > > > > following
> > > > > > > > components, of which - one if a table format, one is a
> > > transactional
> > > > > > > > storage layer.
> > > > > > > > But the whole stack we have is definitely worth more than 
the
> > > sum of
> > > > > > all
> > > > > > > > the parts IMO (speaking from my own experience from the past
> > 10+
> > > > > years
> > > > > > of
> > > > > > > > open source software dev).
> > > > > > > >
> > > > > > > > Here's what we have built so far.
> > > > > > > >
> > > > > > > > a) *table format* : something that stores table schema, a
> > > metadata
> > > > > > table
> > > > > > > > that stores file listing today, and being extended to store
> > > column
> > > > > > ranges
> > > > > > > > and more in the future (RFC-27)
> > > > > > > > b) *aux metadata* : bloom filters, external record level
> > indexes
> > > > > today,
> > > > > > > > bitmaps/interval trees and other advanced on-disk data
> > structures
> > > > > > > tomorrow
> > > > > > > > c) *concurrency control* : we always supported MVCC based 
log
> > > based
> > > > > > > > concurrency (serialize writes into a time ordered log), and
> we
> > > now
> > > > > also
> > > > > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > > > > multi-table
> > > > > > > and
> > > > > > > > fully non-blocking writers soon (see future work section of
> > > RFC-22)
> > > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case
> > for
> > > > > Hudi,
> > > > > > > but
> > > > > > > > we support primary/unique key constraints and we could add
> > > foreign
> > > > > keys
> > > > > > > as
> > > > > > > > an extension, once our transactions can span tables.
> > > > > > > > e) *table services*: a hudi pipeline today is self-managing 
-
> > > sizes
> > > > > > > files,
> > > > > > > > cleans, compacts, clusters data, bootstraps existing data -
> all
> > > these
> > > > > > > > actions working off each other without blocking one another.
> > (for
> > > > > most
> > > > > > > > parts).
> > > > > > > > f) *data services*: we also have higher level functionality
> > with
> > > > > > > > deltastreamer sources (scalable DFS listing source, Kafka,
> > > Pulsar is
> > > > > > > > coming, ...and more), incremental ETL support,
> de-duplication,
   

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Pratyaksh Sharma
Definitely we are doing much more than only ingesting and managing data
over DFS.

+1 from my side as well. :)

On Tue, Apr 13, 2021 at 10:02 PM Susu Dong  wrote:

> I love this rebranding. Totally agree. +1
>
> On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu 
> wrote:
>
> > +1 The vision looks fantastic.
> >
> > On Tue, Apr 13, 2021 at 7:45 AM Gary Li  wrote:
> >
> > > Awesome summary of Hudi! +1 as well.
> > >
> > > Gary Li
> > > On 2021/04/13 14:13:24, Rubens Rodrigues 
> > > wrote:
> > > > Excellent, I agree
> > > >
> > > > Em ter, 13 de abr de 2021 07:23, vino yang 
> > > escreveu:
> > > >
> > > > > +1 Excited by this new vision!
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > Dianjin Wang  于2021年4月13日周二
> > 下午3:53写道:
> > > > >
> > > > > > +1  The new brand is straightforward, a better description of
> Hudi.
> > > > > >
> > > > > > Best,
> > > > > > Dianjin Wang
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > bhavanisud...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 . Cannot agree more. I think this makes total sense and will
> > > provide
> > > > > > for
> > > > > > > a much better representation of the project.
> > > > > > >
> > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > vin...@apache.org
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > > Reading one more article today, positioning Hudi, as just a
> > table
> > > > > > format,
> > > > > > > > made me wonder, if we have done enough justice in explaining
> > > what we
> > > > > > have
> > > > > > > > built together here.
> > > > > > > > I tend to think of Hudi as the data lake platform, which has
> > the
> > > > > > > following
> > > > > > > > components, of which - one if a table format, one is a
> > > transactional
> > > > > > > > storage layer.
> > > > > > > > But the whole stack we have is definitely worth more than the
> > > sum of
> > > > > > all
> > > > > > > > the parts IMO (speaking from my own experience from the past
> > 10+
> > > > > years
> > > > > > of
> > > > > > > > open source software dev).
> > > > > > > >
> > > > > > > > Here's what we have built so far.
> > > > > > > >
> > > > > > > > a) *table format* : something that stores table schema, a
> > > metadata
> > > > > > table
> > > > > > > > that stores file listing today, and being extended to store
> > > column
> > > > > > ranges
> > > > > > > > and more in the future (RFC-27)
> > > > > > > > b) *aux metadata* : bloom filters, external record level
> > indexes
> > > > > today,
> > > > > > > > bitmaps/interval trees and other advanced on-disk data
> > structures
> > > > > > > tomorrow
> > > > > > > > c) *concurrency control* : we always supported MVCC based log
> > > based
> > > > > > > > concurrency (serialize writes into a time ordered log), and
> we
> > > now
> > > > > also
> > > > > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > > > > multi-table
> > > > > > > and
> > > > > > > > fully non-blocking writers soon (see future work section of
> > > RFC-22)
> > > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case
> > for
> > > > > Hudi,
> > > > > > > but
> > > > > > > > we support primary/unique key constraints and we could add
> > > foreign
> > > > > keys
> > > > > > > as
> > > > > > > > an extension, once our transactions can span tables.
> > > > > > > > e) *table services*: a hudi pipeline today is self-managing -
> > > sizes
> > > > > > > files,
> > > > > > > > cleans, compacts, clusters data, bootstraps existing data -
> all
> > > these
> > > > > > > > actions working off each other without blocking one another.
> > (for
> > > > > most
> > > > > > > > parts).
> > > > > > > > f) *data services*: we also have higher level functionality
> > with
> > > > > > > > deltastreamer sources (scalable DFS listing source, Kafka,
> > > Pulsar is
> > > > > > > > coming, ...and more), incremental ETL support,
> de-duplication,
> > > commit
> > > > > > > > callbacks, pre-commit validations are coming, error tables
> have
> > > been
> > > > > > > > proposed. I could also envision us building towards streaming
> > > egress,
> > > > > > > data
> > > > > > > > monitoring.
> > > > > > > >
> > > > > > > > I also think we should build the following (subject to
> separate
> > > > > > > > DISCUSS/RFCs)
> > > > > > > >
> > > > > > > > g) *caching service*: Hudi specific caching service that can
> > hold
> > > > > > mutable
> > > > > > > > data and serve oft-queried data across engines.
> > > > > > > > h) t*imeline metaserver:* We already run a metaserver in
> spark
> > > > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata
> table.
> > > Let's
> > > > > > > turn
> > > > > > > > it into a scalable, sharded metastore, that all engines can
> use
> > > to
> > > > > > obtain
> > > > > > > > any metadata.
> > > > > > > >
> > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*"
> as
> > > > > opposed
> > 

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Susu Dong
I love this rebranding. Totally agree. +1

On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu 
wrote:

> +1 The vision looks fantastic.
>
> On Tue, Apr 13, 2021 at 7:45 AM Gary Li  wrote:
>
> > Awesome summary of Hudi! +1 as well.
> >
> > Gary Li
> > On 2021/04/13 14:13:24, Rubens Rodrigues 
> > wrote:
> > > Excellent, I agree
> > >
> > > Em ter, 13 de abr de 2021 07:23, vino yang 
> > escreveu:
> > >
> > > > +1 Excited by this new vision!
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Dianjin Wang  于2021年4月13日周二
> 下午3:53写道:
> > > >
> > > > > +1  The new brand is straightforward, a better description of Hudi.
> > > > >
> > > > > Best,
> > > > > Dianjin Wang
> > > > >
> > > > >
> > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > bhavanisud...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > +1 . Cannot agree more. I think this makes total sense and will
> > provide
> > > > > for
> > > > > > a much better representation of the project.
> > > > > >
> > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> vin...@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > > Reading one more article today, positioning Hudi, as just a
> table
> > > > > format,
> > > > > > > made me wonder, if we have done enough justice in explaining
> > what we
> > > > > have
> > > > > > > built together here.
> > > > > > > I tend to think of Hudi as the data lake platform, which has
> the
> > > > > > following
> > > > > > > components, of which - one if a table format, one is a
> > transactional
> > > > > > > storage layer.
> > > > > > > But the whole stack we have is definitely worth more than the
> > sum of
> > > > > all
> > > > > > > the parts IMO (speaking from my own experience from the past
> 10+
> > > > years
> > > > > of
> > > > > > > open source software dev).
> > > > > > >
> > > > > > > Here's what we have built so far.
> > > > > > >
> > > > > > > a) *table format* : something that stores table schema, a
> > metadata
> > > > > table
> > > > > > > that stores file listing today, and being extended to store
> > column
> > > > > ranges
> > > > > > > and more in the future (RFC-27)
> > > > > > > b) *aux metadata* : bloom filters, external record level
> indexes
> > > > today,
> > > > > > > bitmaps/interval trees and other advanced on-disk data
> structures
> > > > > > tomorrow
> > > > > > > c) *concurrency control* : we always supported MVCC based log
> > based
> > > > > > > concurrency (serialize writes into a time ordered log), and we
> > now
> > > > also
> > > > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > > > multi-table
> > > > > > and
> > > > > > > fully non-blocking writers soon (see future work section of
> > RFC-22)
> > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case
> for
> > > > Hudi,
> > > > > > but
> > > > > > > we support primary/unique key constraints and we could add
> > foreign
> > > > keys
> > > > > > as
> > > > > > > an extension, once our transactions can span tables.
> > > > > > > e) *table services*: a hudi pipeline today is self-managing -
> > sizes
> > > > > > files,
> > > > > > > cleans, compacts, clusters data, bootstraps existing data - all
> > these
> > > > > > > actions working off each other without blocking one another.
> (for
> > > > most
> > > > > > > parts).
> > > > > > > f) *data services*: we also have higher level functionality
> with
> > > > > > > deltastreamer sources (scalable DFS listing source, Kafka,
> > Pulsar is
> > > > > > > coming, ...and more), incremental ETL support, de-duplication,
> > commit
> > > > > > > callbacks, pre-commit validations are coming, error tables have
> > been
> > > > > > > proposed. I could also envision us building towards streaming
> > egress,
> > > > > > data
> > > > > > > monitoring.
> > > > > > >
> > > > > > > I also think we should build the following (subject to separate
> > > > > > > DISCUSS/RFCs)
> > > > > > >
> > > > > > > g) *caching service*: Hudi specific caching service that can
> hold
> > > > > mutable
> > > > > > > data and serve oft-queried data across engines.
> > > > > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table.
> > Let's
> > > > > > turn
> > > > > > > it into a scalable, sharded metastore, that all engines can use
> > to
> > > > > obtain
> > > > > > > any metadata.
> > > > > > >
> > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> > > > opposed
> > > > > to
> > > > > > > "ingests & manages storage of large analytical datasets over
> DFS
> > > > (hdfs
> > > > > or
> > > > > > > cloud stores)." and convey the scope of our vision,
> > > > > > > given we have already been building towards that. It would also
> > > > provide
> > > > > > new
> > > > > > > contributors a good lens to look at the project from.
> > > > > > >
> > > > > > > (This is very similar to for e.g, the evolution of Kafka from a
> > > > pub-sub
> >

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Raymond Xu
+1 The vision looks fantastic.

On Tue, Apr 13, 2021 at 7:45 AM Gary Li  wrote:

> Awesome summary of Hudi! +1 as well.
>
> Gary Li
> On 2021/04/13 14:13:24, Rubens Rodrigues 
> wrote:
> > Excellent, I agree
> >
> > Em ter, 13 de abr de 2021 07:23, vino yang 
> escreveu:
> >
> > > +1 Excited by this new vision!
> > >
> > > Best,
> > > Vino
> > >
> > > Dianjin Wang  于2021年4月13日周二 下午3:53写道:
> > >
> > > > +1  The new brand is straightforward, a better description of Hudi.
> > > >
> > > > Best,
> > > > Dianjin Wang
> > > >
> > > >
> > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> bhavanisud...@gmail.com>
> > > > wrote:
> > > >
> > > > > +1 . Cannot agree more. I think this makes total sense and will
> provide
> > > > for
> > > > > a much better representation of the project.
> > > > >
> > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar  >
> > > > wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > Reading one more article today, positioning Hudi, as just a table
> > > > format,
> > > > > > made me wonder, if we have done enough justice in explaining
> what we
> > > > have
> > > > > > built together here.
> > > > > > I tend to think of Hudi as the data lake platform, which has the
> > > > > following
> > > > > > components, of which - one if a table format, one is a
> transactional
> > > > > > storage layer.
> > > > > > But the whole stack we have is definitely worth more than the
> sum of
> > > > all
> > > > > > the parts IMO (speaking from my own experience from the past 10+
> > > years
> > > > of
> > > > > > open source software dev).
> > > > > >
> > > > > > Here's what we have built so far.
> > > > > >
> > > > > > a) *table format* : something that stores table schema, a
> metadata
> > > > table
> > > > > > that stores file listing today, and being extended to store
> column
> > > > ranges
> > > > > > and more in the future (RFC-27)
> > > > > > b) *aux metadata* : bloom filters, external record level indexes
> > > today,
> > > > > > bitmaps/interval trees and other advanced on-disk data structures
> > > > > tomorrow
> > > > > > c) *concurrency control* : we always supported MVCC based log
> based
> > > > > > concurrency (serialize writes into a time ordered log), and we
> now
> > > also
> > > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > > multi-table
> > > > > and
> > > > > > fully non-blocking writers soon (see future work section of
> RFC-22)
> > > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> > > Hudi,
> > > > > but
> > > > > > we support primary/unique key constraints and we could add
> foreign
> > > keys
> > > > > as
> > > > > > an extension, once our transactions can span tables.
> > > > > > e) *table services*: a hudi pipeline today is self-managing -
> sizes
> > > > > files,
> > > > > > cleans, compacts, clusters data, bootstraps existing data - all
> these
> > > > > > actions working off each other without blocking one another. (for
> > > most
> > > > > > parts).
> > > > > > f) *data services*: we also have higher level functionality with
> > > > > > deltastreamer sources (scalable DFS listing source, Kafka,
> Pulsar is
> > > > > > coming, ...and more), incremental ETL support, de-duplication,
> commit
> > > > > > callbacks, pre-commit validations are coming, error tables have
> been
> > > > > > proposed. I could also envision us building towards streaming
> egress,
> > > > > data
> > > > > > monitoring.
> > > > > >
> > > > > > I also think we should build the following (subject to separate
> > > > > > DISCUSS/RFCs)
> > > > > >
> > > > > > g) *caching service*: Hudi specific caching service that can hold
> > > > mutable
> > > > > > data and serve oft-queried data across engines.
> > > > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table.
> Let's
> > > > > turn
> > > > > > it into a scalable, sharded metastore, that all engines can use
> to
> > > > obtain
> > > > > > any metadata.
> > > > > >
> > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> > > opposed
> > > > to
> > > > > > "ingests & manages storage of large analytical datasets over DFS
> > > (hdfs
> > > > or
> > > > > > cloud stores)." and convey the scope of our vision,
> > > > > > given we have already been building towards that. It would also
> > > provide
> > > > > new
> > > > > > contributors a good lens to look at the project from.
> > > > > >
> > > > > > (This is very similar to for e.g, the evolution of Kafka from a
> > > pub-sub
> > > > > > system, to an event streaming platform - with addition of
> > > > > > MirrorMaker/Connect etc. )
> > > > > >
> > > > > > Please share your thoughts!
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread vbal...@apache.org
 ++1. The rewording makes total sense
Balaji.V
On Tuesday, April 13, 2021, 07:45:16 AM PDT, Gary Li  
wrote:  
 
 Awesome summary of Hudi! +1 as well. 

Gary Li
On 2021/04/13 14:13:24, Rubens Rodrigues  wrote: 
> Excellent, I agree
> 
> Em ter, 13 de abr de 2021 07:23, vino yang  escreveu:
> 
> > +1 Excited by this new vision!
> >
> > Best,
> > Vino
> >
> > Dianjin Wang  于2021年4月13日周二 下午3:53写道:
> >
> > > +1  The new brand is straightforward, a better description of Hudi.
> > >
> > > Best,
> > > Dianjin Wang
> > >
> > >
> > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha 
> > > wrote:
> > >
> > > > +1 . Cannot agree more. I think this makes total sense and will provide
> > > for
> > > > a much better representation of the project.
> > > >
> > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Reading one more article today, positioning Hudi, as just a table
> > > format,
> > > > > made me wonder, if we have done enough justice in explaining what we
> > > have
> > > > > built together here.
> > > > > I tend to think of Hudi as the data lake platform, which has the
> > > > following
> > > > > components, of which - one if a table format, one is a transactional
> > > > > storage layer.
> > > > > But the whole stack we have is definitely worth more than the sum of
> > > all
> > > > > the parts IMO (speaking from my own experience from the past 10+
> > years
> > > of
> > > > > open source software dev).
> > > > >
> > > > > Here's what we have built so far.
> > > > >
> > > > > a) *table format* : something that stores table schema, a metadata
> > > table
> > > > > that stores file listing today, and being extended to store column
> > > ranges
> > > > > and more in the future (RFC-27)
> > > > > b) *aux metadata* : bloom filters, external record level indexes
> > today,
> > > > > bitmaps/interval trees and other advanced on-disk data structures
> > > > tomorrow
> > > > > c) *concurrency control* : we always supported MVCC based log based
> > > > > concurrency (serialize writes into a time ordered log), and we now
> > also
> > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > multi-table
> > > > and
> > > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> > Hudi,
> > > > but
> > > > > we support primary/unique key constraints and we could add foreign
> > keys
> > > > as
> > > > > an extension, once our transactions can span tables.
> > > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > > files,
> > > > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > > > actions working off each other without blocking one another. (for
> > most
> > > > > parts).
> > > > > f) *data services*: we also have higher level functionality with
> > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > > > callbacks, pre-commit validations are coming, error tables have been
> > > > > proposed. I could also envision us building towards streaming egress,
> > > > data
> > > > > monitoring.
> > > > >
> > > > > I also think we should build the following (subject to separate
> > > > > DISCUSS/RFCs)
> > > > >
> > > > > g) *caching service*: Hudi specific caching service that can hold
> > > mutable
> > > > > data and serve oft-queried data across engines.
> > > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > > > turn
> > > > > it into a scalable, sharded metastore, that all engines can use to
> > > obtain
> > > > > any metadata.
> > > > >
> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> > opposed
> > > to
> > > > > "ingests & manages storage of large analytical datasets over DFS
> > (hdfs
> > > or
> > > > > cloud stores)." and convey the scope of our vision,
> > > > > given we have already been building towards that. It would also
> > provide
> > > > new
> > > > > contributors a good lens to look at the project from.
> > > > >
> > > > > (This is very similar to for e.g, the evolution of Kafka from a
> > pub-sub
> > > > > system, to an event streaming platform - with addition of
> > > > > MirrorMaker/Connect etc. )
> > > > >
> > > > > Please share your thoughts!
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > >
> > >
> >
> 
  

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Gary Li
Awesome summary of Hudi! +1 as well. 

Gary Li
On 2021/04/13 14:13:24, Rubens Rodrigues  wrote: 
> Excellent, I agree
> 
> Em ter, 13 de abr de 2021 07:23, vino yang  escreveu:
> 
> > +1 Excited by this new vision!
> >
> > Best,
> > Vino
> >
> > Dianjin Wang  于2021年4月13日周二 下午3:53写道:
> >
> > > +1  The new brand is straightforward, a better description of Hudi.
> > >
> > > Best,
> > > Dianjin Wang
> > >
> > >
> > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha 
> > > wrote:
> > >
> > > > +1 . Cannot agree more. I think this makes total sense and will provide
> > > for
> > > > a much better representation of the project.
> > > >
> > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Reading one more article today, positioning Hudi, as just a table
> > > format,
> > > > > made me wonder, if we have done enough justice in explaining what we
> > > have
> > > > > built together here.
> > > > > I tend to think of Hudi as the data lake platform, which has the
> > > > following
> > > > > components, of which - one if a table format, one is a transactional
> > > > > storage layer.
> > > > > But the whole stack we have is definitely worth more than the sum of
> > > all
> > > > > the parts IMO (speaking from my own experience from the past 10+
> > years
> > > of
> > > > > open source software dev).
> > > > >
> > > > > Here's what we have built so far.
> > > > >
> > > > > a) *table format* : something that stores table schema, a metadata
> > > table
> > > > > that stores file listing today, and being extended to store column
> > > ranges
> > > > > and more in the future (RFC-27)
> > > > > b) *aux metadata* : bloom filters, external record level indexes
> > today,
> > > > > bitmaps/interval trees and other advanced on-disk data structures
> > > > tomorrow
> > > > > c) *concurrency control* : we always supported MVCC based log based
> > > > > concurrency (serialize writes into a time ordered log), and we now
> > also
> > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > multi-table
> > > > and
> > > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> > Hudi,
> > > > but
> > > > > we support primary/unique key constraints and we could add foreign
> > keys
> > > > as
> > > > > an extension, once our transactions can span tables.
> > > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > > files,
> > > > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > > > actions working off each other without blocking one another. (for
> > most
> > > > > parts).
> > > > > f) *data services*: we also have higher level functionality with
> > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > > > callbacks, pre-commit validations are coming, error tables have been
> > > > > proposed. I could also envision us building towards streaming egress,
> > > > data
> > > > > monitoring.
> > > > >
> > > > > I also think we should build the following (subject to separate
> > > > > DISCUSS/RFCs)
> > > > >
> > > > > g) *caching service*: Hudi specific caching service that can hold
> > > mutable
> > > > > data and serve oft-queried data across engines.
> > > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > > > turn
> > > > > it into a scalable, sharded metastore, that all engines can use to
> > > obtain
> > > > > any metadata.
> > > > >
> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> > opposed
> > > to
> > > > > "ingests & manages storage of large analytical datasets over DFS
> > (hdfs
> > > or
> > > > > cloud stores)." and convey the scope of our vision,
> > > > > given we have already been building towards that. It would also
> > provide
> > > > new
> > > > > contributors a good lens to look at the project from.
> > > > >
> > > > > (This is very similar to for e.g, the evolution of Kafka from a
> > pub-sub
> > > > > system, to an event streaming platform - with addition of
> > > > > MirrorMaker/Connect etc. )
> > > > >
> > > > > Please share your thoughts!
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > >
> > >
> >
> 


Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Rubens Rodrigues
Excellent, I agree

Em ter, 13 de abr de 2021 07:23, vino yang  escreveu:

> +1 Excited by this new vision!
>
> Best,
> Vino
>
> Dianjin Wang  于2021年4月13日周二 下午3:53写道:
>
> > +1  The new brand is straightforward, a better description of Hudi.
> >
> > Best,
> > Dianjin Wang
> >
> >
> > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha 
> > wrote:
> >
> > > +1 . Cannot agree more. I think this makes total sense and will provide
> > for
> > > a much better representation of the project.
> > >
> > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > Reading one more article today, positioning Hudi, as just a table
> > format,
> > > > made me wonder, if we have done enough justice in explaining what we
> > have
> > > > built together here.
> > > > I tend to think of Hudi as the data lake platform, which has the
> > > following
> > > > components, of which - one if a table format, one is a transactional
> > > > storage layer.
> > > > But the whole stack we have is definitely worth more than the sum of
> > all
> > > > the parts IMO (speaking from my own experience from the past 10+
> years
> > of
> > > > open source software dev).
> > > >
> > > > Here's what we have built so far.
> > > >
> > > > a) *table format* : something that stores table schema, a metadata
> > table
> > > > that stores file listing today, and being extended to store column
> > ranges
> > > > and more in the future (RFC-27)
> > > > b) *aux metadata* : bloom filters, external record level indexes
> today,
> > > > bitmaps/interval trees and other advanced on-disk data structures
> > > tomorrow
> > > > c) *concurrency control* : we always supported MVCC based log based
> > > > concurrency (serialize writes into a time ordered log), and we now
> also
> > > > have OCC for batch merge workloads with 0.8.0. We will have
> multi-table
> > > and
> > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> Hudi,
> > > but
> > > > we support primary/unique key constraints and we could add foreign
> keys
> > > as
> > > > an extension, once our transactions can span tables.
> > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > files,
> > > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > > actions working off each other without blocking one another. (for
> most
> > > > parts).
> > > > f) *data services*: we also have higher level functionality with
> > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > > callbacks, pre-commit validations are coming, error tables have been
> > > > proposed. I could also envision us building towards streaming egress,
> > > data
> > > > monitoring.
> > > >
> > > > I also think we should build the following (subject to separate
> > > > DISCUSS/RFCs)
> > > >
> > > > g) *caching service*: Hudi specific caching service that can hold
> > mutable
> > > > data and serve oft-queried data across engines.
> > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > > turn
> > > > it into a scalable, sharded metastore, that all engines can use to
> > obtain
> > > > any metadata.
> > > >
> > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> opposed
> > to
> > > > "ingests & manages storage of large analytical datasets over DFS
> (hdfs
> > or
> > > > cloud stores)." and convey the scope of our vision,
> > > > given we have already been building towards that. It would also
> provide
> > > new
> > > > contributors a good lens to look at the project from.
> > > >
> > > > (This is very similar to for e.g, the evolution of Kafka from a
> pub-sub
> > > > system, to an event streaming platform - with addition of
> > > > MirrorMaker/Connect etc. )
> > > >
> > > > Please share your thoughts!
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>


Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread vino yang
+1 Excited by this new vision!

Best,
Vino

Dianjin Wang  于2021年4月13日周二 下午3:53写道:

> +1  The new brand is straightforward, a better description of Hudi.
>
> Best,
> Dianjin Wang
>
>
> On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha 
> wrote:
>
> > +1 . Cannot agree more. I think this makes total sense and will provide
> for
> > a much better representation of the project.
> >
> > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar 
> wrote:
> >
> > > Hello all,
> > >
> > > Reading one more article today, positioning Hudi, as just a table
> format,
> > > made me wonder, if we have done enough justice in explaining what we
> have
> > > built together here.
> > > I tend to think of Hudi as the data lake platform, which has the
> > following
> > > components, of which - one if a table format, one is a transactional
> > > storage layer.
> > > But the whole stack we have is definitely worth more than the sum of
> all
> > > the parts IMO (speaking from my own experience from the past 10+ years
> of
> > > open source software dev).
> > >
> > > Here's what we have built so far.
> > >
> > > a) *table format* : something that stores table schema, a metadata
> table
> > > that stores file listing today, and being extended to store column
> ranges
> > > and more in the future (RFC-27)
> > > b) *aux metadata* : bloom filters, external record level indexes today,
> > > bitmaps/interval trees and other advanced on-disk data structures
> > tomorrow
> > > c) *concurrency control* : we always supported MVCC based log based
> > > concurrency (serialize writes into a time ordered log), and we now also
> > > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> > and
> > > fully non-blocking writers soon (see future work section of RFC-22)
> > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> > but
> > > we support primary/unique key constraints and we could add foreign keys
> > as
> > > an extension, once our transactions can span tables.
> > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > files,
> > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > actions working off each other without blocking one another. (for most
> > > parts).
> > > f) *data services*: we also have higher level functionality with
> > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > callbacks, pre-commit validations are coming, error tables have been
> > > proposed. I could also envision us building towards streaming egress,
> > data
> > > monitoring.
> > >
> > > I also think we should build the following (subject to separate
> > > DISCUSS/RFCs)
> > >
> > > g) *caching service*: Hudi specific caching service that can hold
> mutable
> > > data and serve oft-queried data across engines.
> > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > turn
> > > it into a scalable, sharded metastore, that all engines can use to
> obtain
> > > any metadata.
> > >
> > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed
> to
> > > "ingests & manages storage of large analytical datasets over DFS (hdfs
> or
> > > cloud stores)." and convey the scope of our vision,
> > > given we have already been building towards that. It would also provide
> > new
> > > contributors a good lens to look at the project from.
> > >
> > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > > system, to an event streaming platform - with addition of
> > > MirrorMaker/Connect etc. )
> > >
> > > Please share your thoughts!
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>


Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Dianjin Wang
+1  The new brand is straightforward, a better description of Hudi.

Best,
Dianjin Wang


On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha 
wrote:

> +1 . Cannot agree more. I think this makes total sense and will provide for
> a much better representation of the project.
>
> On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar  wrote:
>
> > Hello all,
> >
> > Reading one more article today, positioning Hudi, as just a table format,
> > made me wonder, if we have done enough justice in explaining what we have
> > built together here.
> > I tend to think of Hudi as the data lake platform, which has the
> following
> > components, of which - one if a table format, one is a transactional
> > storage layer.
> > But the whole stack we have is definitely worth more than the sum of all
> > the parts IMO (speaking from my own experience from the past 10+ years of
> > open source software dev).
> >
> > Here's what we have built so far.
> >
> > a) *table format* : something that stores table schema, a metadata table
> > that stores file listing today, and being extended to store column ranges
> > and more in the future (RFC-27)
> > b) *aux metadata* : bloom filters, external record level indexes today,
> > bitmaps/interval trees and other advanced on-disk data structures
> tomorrow
> > c) *concurrency control* : we always supported MVCC based log based
> > concurrency (serialize writes into a time ordered log), and we now also
> > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> and
> > fully non-blocking writers soon (see future work section of RFC-22)
> > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> but
> > we support primary/unique key constraints and we could add foreign keys
> as
> > an extension, once our transactions can span tables.
> > e) *table services*: a hudi pipeline today is self-managing - sizes
> files,
> > cleans, compacts, clusters data, bootstraps existing data - all these
> > actions working off each other without blocking one another. (for most
> > parts).
> > f) *data services*: we also have higher level functionality with
> > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > coming, ...and more), incremental ETL support, de-duplication, commit
> > callbacks, pre-commit validations are coming, error tables have been
> > proposed. I could also envision us building towards streaming egress,
> data
> > monitoring.
> >
> > I also think we should build the following (subject to separate
> > DISCUSS/RFCs)
> >
> > g) *caching service*: Hudi specific caching service that can hold mutable
> > data and serve oft-queried data across engines.
> > h) t*imeline metaserver:* We already run a metaserver in spark
> > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> turn
> > it into a scalable, sharded metastore, that all engines can use to obtain
> > any metadata.
> >
> > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
> > "ingests & manages storage of large analytical datasets over DFS (hdfs or
> > cloud stores)." and convey the scope of our vision,
> > given we have already been building towards that. It would also provide
> new
> > contributors a good lens to look at the project from.
> >
> > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > system, to an event streaming platform - with addition of
> > MirrorMaker/Connect etc. )
> >
> > Please share your thoughts!
> >
> > Thanks
> > Vinoth
> >
>


Re: [DISCUSS] Hudi is the data lake platform

2021-04-12 Thread Bhavani Sudha
+1 . Cannot agree more. I think this makes total sense and will provide for
a much better representation of the project.

On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar  wrote:

> Hello all,
>
> Reading one more article today, positioning Hudi, as just a table format,
> made me wonder, if we have done enough justice in explaining what we have
> built together here.
> I tend to think of Hudi as the data lake platform, which has the following
> components, of which - one if a table format, one is a transactional
> storage layer.
> But the whole stack we have is definitely worth more than the sum of all
> the parts IMO (speaking from my own experience from the past 10+ years of
> open source software dev).
>
> Here's what we have built so far.
>
> a) *table format* : something that stores table schema, a metadata table
> that stores file listing today, and being extended to store column ranges
> and more in the future (RFC-27)
> b) *aux metadata* : bloom filters, external record level indexes today,
> bitmaps/interval trees and other advanced on-disk data structures tomorrow
> c) *concurrency control* : we always supported MVCC based log based
> concurrency (serialize writes into a time ordered log), and we now also
> have OCC for batch merge workloads with 0.8.0. We will have multi-table and
> fully non-blocking writers soon (see future work section of RFC-22)
> d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but
> we support primary/unique key constraints and we could add foreign keys as
> an extension, once our transactions can span tables.
> e) *table services*: a hudi pipeline today is self-managing - sizes files,
> cleans, compacts, clusters data, bootstraps existing data - all these
> actions working off each other without blocking one another. (for most
> parts).
> f) *data services*: we also have higher level functionality with
> deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> coming, ...and more), incremental ETL support, de-duplication, commit
> callbacks, pre-commit validations are coming, error tables have been
> proposed. I could also envision us building towards streaming egress, data
> monitoring.
>
> I also think we should build the following (subject to separate
> DISCUSS/RFCs)
>
> g) *caching service*: Hudi specific caching service that can hold mutable
> data and serve oft-queried data across engines.
> h) t*imeline metaserver:* We already run a metaserver in spark
> writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn
> it into a scalable, sharded metastore, that all engines can use to obtain
> any metadata.
>
> To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
> "ingests & manages storage of large analytical datasets over DFS (hdfs or
> cloud stores)." and convey the scope of our vision,
> given we have already been building towards that. It would also provide new
> contributors a good lens to look at the project from.
>
> (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> system, to an event streaming platform - with addition of
> MirrorMaker/Connect etc. )
>
> Please share your thoughts!
>
> Thanks
> Vinoth
>


[DISCUSS] Hudi is the data lake platform

2021-04-12 Thread Vinoth Chandar
Hello all,

Reading one more article today, positioning Hudi, as just a table format,
made me wonder, if we have done enough justice in explaining what we have
built together here.
I tend to think of Hudi as the data lake platform, which has the following
components, of which - one if a table format, one is a transactional
storage layer.
But the whole stack we have is definitely worth more than the sum of all
the parts IMO (speaking from my own experience from the past 10+ years of
open source software dev).

Here's what we have built so far.

a) *table format* : something that stores table schema, a metadata table
that stores file listing today, and being extended to store column ranges
and more in the future (RFC-27)
b) *aux metadata* : bloom filters, external record level indexes today,
bitmaps/interval trees and other advanced on-disk data structures tomorrow
c) *concurrency control* : we always supported MVCC based log based
concurrency (serialize writes into a time ordered log), and we now also
have OCC for batch merge workloads with 0.8.0. We will have multi-table and
fully non-blocking writers soon (see future work section of RFC-22)
d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but
we support primary/unique key constraints and we could add foreign keys as
an extension, once our transactions can span tables.
e) *table services*: a hudi pipeline today is self-managing - sizes files,
cleans, compacts, clusters data, bootstraps existing data - all these
actions working off each other without blocking one another. (for most
parts).
f) *data services*: we also have higher level functionality with
deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
coming, ...and more), incremental ETL support, de-duplication, commit
callbacks, pre-commit validations are coming, error tables have been
proposed. I could also envision us building towards streaming egress, data
monitoring.

I also think we should build the following (subject to separate
DISCUSS/RFCs)

g) *caching service*: Hudi specific caching service that can hold mutable
data and serve oft-queried data across engines.
h) t*imeline metaserver:* We already run a metaserver in spark
writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn
it into a scalable, sharded metastore, that all engines can use to obtain
any metadata.

To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
"ingests & manages storage of large analytical datasets over DFS (hdfs or
cloud stores)." and convey the scope of our vision,
given we have already been building towards that. It would also provide new
contributors a good lens to look at the project from.

(This is very similar to for e.g, the evolution of Kafka from a pub-sub
system, to an event streaming platform - with addition of
MirrorMaker/Connect etc. )

Please share your thoughts!

Thanks
Vinoth