Re: [DISCUSS] Hudi is the data lake platform
Folks, I have been digesting some feedback on what we show on the home page itself. While the blog explains the vision, it might be good to bubble up sub-areas that are more relevant to our users today. transactions, updates, deletes. So, i have raised a PR moving stuff around. Now we lead with - "Hudi brings transactions, record-level updates/deletes and change streams to data lakes" then explain the platform, in the next level of detail. https://github.com/apache/hudi/pull/3406 On Mon, Aug 2, 2021 at 9:39 AM Vinoth Chandar wrote: > Thanks! Will work on it this week. > Also redoing some images based on feedback. > > On Fri, Jul 30, 2021 at 2:06 AM vino yang wrote: > >> +1 >> >> Pratyaksh Sharma 于2021年7月30日周五 上午1:47写道: >> >> > Guess we should rebrand Hudi on README.md file as well - >> > https://github.com/apache/hudi#readme? >> > >> > This page still mentions the following - >> > >> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and >> > Incrementals. Hudi manages the storage of large analytical datasets on >> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)." >> > >> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar >> wrote: >> > >> >> Thanks Vino! Got a bunch of emoticons on the PR as well. >> >> >> >> Will land this monday, giving it more time over the weekend as well. >> >> >> >> >> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang >> wrote: >> >> >> >> > Thanks vc >> >> > >> >> > Very good blog, in-depth and forward-looking. Learned! >> >> > >> >> > Best, >> >> > Vino >> >> > >> >> > Vinoth Chandar 于2021年7月22日周四 上午3:58写道: >> >> > >> >> > > Expanding to users@ as well. >> >> > > >> >> > > Hi all, >> >> > > >> >> > > Since this discussion, I started to pen down a coherent strategy >> and >> >> > convey >> >> > > these ideas via a blog post. >> >> > > I have also done my own research, talked to (ex)colleagues I >> respect >> >> to >> >> > get >> >> > > their take and refine it. >> >> > > >> >> > > Here's a blog that hopefully explains this vision. >> >> > > >> >> > > https://github.com/apache/hudi/pull/3322 >> >> > > >> >> > > Look forward to your feedback on the PR. We are hoping to land this >> >> early >> >> > > next week, if everyone is aligned. >> >> > > >> >> > > Thanks >> >> > > Vinoth >> >> > > >> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li >> wrote: >> >> > > >> >> > > > +1 , Cannot agree more. >> >> > > > *aux metadata* and metatable, can make hudi have large >> preformance >> >> > > > optimization on query end. >> >> > > > Can continuous develop. >> >> > > > cache service may the necessary component in cloud native >> >> environment. >> >> > > > >> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar >> wrote: >> >> > > > > Hello all, >> >> > > > > >> >> > > > > Reading one more article today, positioning Hudi, as just a >> table >> >> > > format, >> >> > > > > made me wonder, if we have done enough justice in explaining >> what >> >> we >> >> > > have >> >> > > > > built together here. >> >> > > > > I tend to think of Hudi as the data lake platform, which has >> the >> >> > > > following >> >> > > > > components, of which - one if a table format, one is a >> >> transactional >> >> > > > > storage layer. >> >> > > > > But the whole stack we have is definitely worth more than the >> sum >> >> of >> >> > > all >> >> > > > > the parts IMO (speaking from my own experience from the past >> 10+ >> >> > years >> >> > > of >> >> > > > > open source software dev). >> >> > > > > >> >> > > > > Here's what we have built so far. >> >> > > > > >> >> > > > > a) *table format* : something that stores table schema, a >> metadata >> >> > > table >> >> > > > > that stores file listing today, and being extended to store >> column >> >> > > ranges >> >> > > > > and more in the future (RFC-27) >> >> > > > > b) *aux metadata* : bloom filters, external record level >> indexes >> >> > today, >> >> > > > > bitmaps/interval trees and other advanced on-disk data >> structures >> >> > > > tomorrow >> >> > > > > c) *concurrency control* : we always supported MVCC based log >> >> based >> >> > > > > concurrency (serialize writes into a time ordered log), and we >> now >> >> > also >> >> > > > > have OCC for batch merge workloads with 0.8.0. We will have >> >> > multi-table >> >> > > > and >> >> > > > > fully non-blocking writers soon (see future work section of >> >> RFC-22) >> >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case >> for >> >> > Hudi, >> >> > > > but >> >> > > > > we support primary/unique key constraints and we could add >> foreign >> >> > keys >> >> > > > as >> >> > > > > an extension, once our transactions can span tables. >> >> > > > > e) *table services*: a hudi pipeline today is self-managing - >> >> sizes >> >> > > > files, >> >> > > > > cleans, compacts, clusters data, bootstraps existing data - all >> >> these >> >> > > > > actions working off each other without blocking one another. >> (for >> >> > most >> >> > > > > parts). >
Re: [DISCUSS] Hudi is the data lake platform
Thanks! Will work on it this week. Also redoing some images based on feedback. On Fri, Jul 30, 2021 at 2:06 AM vino yang wrote: > +1 > > Pratyaksh Sharma 于2021年7月30日周五 上午1:47写道: > > > Guess we should rebrand Hudi on README.md file as well - > > https://github.com/apache/hudi#readme? > > > > This page still mentions the following - > > > > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and > > Incrementals. Hudi manages the storage of large analytical datasets on > > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)." > > > > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar > wrote: > > > >> Thanks Vino! Got a bunch of emoticons on the PR as well. > >> > >> Will land this monday, giving it more time over the weekend as well. > >> > >> > >> On Wed, Jul 21, 2021 at 7:36 PM vino yang > wrote: > >> > >> > Thanks vc > >> > > >> > Very good blog, in-depth and forward-looking. Learned! > >> > > >> > Best, > >> > Vino > >> > > >> > Vinoth Chandar 于2021年7月22日周四 上午3:58写道: > >> > > >> > > Expanding to users@ as well. > >> > > > >> > > Hi all, > >> > > > >> > > Since this discussion, I started to pen down a coherent strategy and > >> > convey > >> > > these ideas via a blog post. > >> > > I have also done my own research, talked to (ex)colleagues I respect > >> to > >> > get > >> > > their take and refine it. > >> > > > >> > > Here's a blog that hopefully explains this vision. > >> > > > >> > > https://github.com/apache/hudi/pull/3322 > >> > > > >> > > Look forward to your feedback on the PR. We are hoping to land this > >> early > >> > > next week, if everyone is aligned. > >> > > > >> > > Thanks > >> > > Vinoth > >> > > > >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li > wrote: > >> > > > >> > > > +1 , Cannot agree more. > >> > > > *aux metadata* and metatable, can make hudi have large > preformance > >> > > > optimization on query end. > >> > > > Can continuous develop. > >> > > > cache service may the necessary component in cloud native > >> environment. > >> > > > > >> > > > On 2021/04/13 05:29:55, Vinoth Chandar wrote: > >> > > > > Hello all, > >> > > > > > >> > > > > Reading one more article today, positioning Hudi, as just a > table > >> > > format, > >> > > > > made me wonder, if we have done enough justice in explaining > what > >> we > >> > > have > >> > > > > built together here. > >> > > > > I tend to think of Hudi as the data lake platform, which has the > >> > > > following > >> > > > > components, of which - one if a table format, one is a > >> transactional > >> > > > > storage layer. > >> > > > > But the whole stack we have is definitely worth more than the > sum > >> of > >> > > all > >> > > > > the parts IMO (speaking from my own experience from the past 10+ > >> > years > >> > > of > >> > > > > open source software dev). > >> > > > > > >> > > > > Here's what we have built so far. > >> > > > > > >> > > > > a) *table format* : something that stores table schema, a > metadata > >> > > table > >> > > > > that stores file listing today, and being extended to store > column > >> > > ranges > >> > > > > and more in the future (RFC-27) > >> > > > > b) *aux metadata* : bloom filters, external record level indexes > >> > today, > >> > > > > bitmaps/interval trees and other advanced on-disk data > structures > >> > > > tomorrow > >> > > > > c) *concurrency control* : we always supported MVCC based log > >> based > >> > > > > concurrency (serialize writes into a time ordered log), and we > now > >> > also > >> > > > > have OCC for batch merge workloads with 0.8.0. We will have > >> > multi-table > >> > > > and > >> > > > > fully non-blocking writers soon (see future work section of > >> RFC-22) > >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for > >> > Hudi, > >> > > > but > >> > > > > we support primary/unique key constraints and we could add > foreign > >> > keys > >> > > > as > >> > > > > an extension, once our transactions can span tables. > >> > > > > e) *table services*: a hudi pipeline today is self-managing - > >> sizes > >> > > > files, > >> > > > > cleans, compacts, clusters data, bootstraps existing data - all > >> these > >> > > > > actions working off each other without blocking one another. > (for > >> > most > >> > > > > parts). > >> > > > > f) *data services*: we also have higher level functionality with > >> > > > > deltastreamer sources (scalable DFS listing source, Kafka, > Pulsar > >> is > >> > > > > coming, ...and more), incremental ETL support, de-duplication, > >> commit > >> > > > > callbacks, pre-commit validations are coming, error tables have > >> been > >> > > > > proposed. I could also envision us building towards streaming > >> egress, > >> > > > data > >> > > > > monitoring. > >> > > > > > >> > > > > I also think we should build the following (subject to separate > >> > > > > DISCUSS/RFCs) > >> > > > > > >> > > > > g) *caching service*: Hudi specific caching service that can > hold > >> > > mutable > >> > > >
Re: [DISCUSS] Hudi is the data lake platform
+1 Pratyaksh Sharma 于2021年7月30日周五 上午1:47写道: > Guess we should rebrand Hudi on README.md file as well - > https://github.com/apache/hudi#readme? > > This page still mentions the following - > > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and > Incrementals. Hudi manages the storage of large analytical datasets on > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)." > > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar wrote: > >> Thanks Vino! Got a bunch of emoticons on the PR as well. >> >> Will land this monday, giving it more time over the weekend as well. >> >> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang wrote: >> >> > Thanks vc >> > >> > Very good blog, in-depth and forward-looking. Learned! >> > >> > Best, >> > Vino >> > >> > Vinoth Chandar 于2021年7月22日周四 上午3:58写道: >> > >> > > Expanding to users@ as well. >> > > >> > > Hi all, >> > > >> > > Since this discussion, I started to pen down a coherent strategy and >> > convey >> > > these ideas via a blog post. >> > > I have also done my own research, talked to (ex)colleagues I respect >> to >> > get >> > > their take and refine it. >> > > >> > > Here's a blog that hopefully explains this vision. >> > > >> > > https://github.com/apache/hudi/pull/3322 >> > > >> > > Look forward to your feedback on the PR. We are hoping to land this >> early >> > > next week, if everyone is aligned. >> > > >> > > Thanks >> > > Vinoth >> > > >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li wrote: >> > > >> > > > +1 , Cannot agree more. >> > > > *aux metadata* and metatable, can make hudi have large preformance >> > > > optimization on query end. >> > > > Can continuous develop. >> > > > cache service may the necessary component in cloud native >> environment. >> > > > >> > > > On 2021/04/13 05:29:55, Vinoth Chandar wrote: >> > > > > Hello all, >> > > > > >> > > > > Reading one more article today, positioning Hudi, as just a table >> > > format, >> > > > > made me wonder, if we have done enough justice in explaining what >> we >> > > have >> > > > > built together here. >> > > > > I tend to think of Hudi as the data lake platform, which has the >> > > > following >> > > > > components, of which - one if a table format, one is a >> transactional >> > > > > storage layer. >> > > > > But the whole stack we have is definitely worth more than the sum >> of >> > > all >> > > > > the parts IMO (speaking from my own experience from the past 10+ >> > years >> > > of >> > > > > open source software dev). >> > > > > >> > > > > Here's what we have built so far. >> > > > > >> > > > > a) *table format* : something that stores table schema, a metadata >> > > table >> > > > > that stores file listing today, and being extended to store column >> > > ranges >> > > > > and more in the future (RFC-27) >> > > > > b) *aux metadata* : bloom filters, external record level indexes >> > today, >> > > > > bitmaps/interval trees and other advanced on-disk data structures >> > > > tomorrow >> > > > > c) *concurrency control* : we always supported MVCC based log >> based >> > > > > concurrency (serialize writes into a time ordered log), and we now >> > also >> > > > > have OCC for batch merge workloads with 0.8.0. We will have >> > multi-table >> > > > and >> > > > > fully non-blocking writers soon (see future work section of >> RFC-22) >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for >> > Hudi, >> > > > but >> > > > > we support primary/unique key constraints and we could add foreign >> > keys >> > > > as >> > > > > an extension, once our transactions can span tables. >> > > > > e) *table services*: a hudi pipeline today is self-managing - >> sizes >> > > > files, >> > > > > cleans, compacts, clusters data, bootstraps existing data - all >> these >> > > > > actions working off each other without blocking one another. (for >> > most >> > > > > parts). >> > > > > f) *data services*: we also have higher level functionality with >> > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar >> is >> > > > > coming, ...and more), incremental ETL support, de-duplication, >> commit >> > > > > callbacks, pre-commit validations are coming, error tables have >> been >> > > > > proposed. I could also envision us building towards streaming >> egress, >> > > > data >> > > > > monitoring. >> > > > > >> > > > > I also think we should build the following (subject to separate >> > > > > DISCUSS/RFCs) >> > > > > >> > > > > g) *caching service*: Hudi specific caching service that can hold >> > > mutable >> > > > > data and serve oft-queried data across engines. >> > > > > h) t*imeline metaserver:* We already run a metaserver in spark >> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. >> Let's >> > > > turn >> > > > > it into a scalable, sharded metastore, that all engines can use to >> > > obtain >> > > > > any metadata. >> > > > > >> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as >> > oppose
Re: [DISCUSS] Hudi is the data lake platform
Guess we should rebrand Hudi on README.md file as well - https://github.com/apache/hudi#readme? This page still mentions the following - "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)." On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar wrote: > Thanks Vino! Got a bunch of emoticons on the PR as well. > > Will land this monday, giving it more time over the weekend as well. > > > On Wed, Jul 21, 2021 at 7:36 PM vino yang wrote: > > > Thanks vc > > > > Very good blog, in-depth and forward-looking. Learned! > > > > Best, > > Vino > > > > Vinoth Chandar 于2021年7月22日周四 上午3:58写道: > > > > > Expanding to users@ as well. > > > > > > Hi all, > > > > > > Since this discussion, I started to pen down a coherent strategy and > > convey > > > these ideas via a blog post. > > > I have also done my own research, talked to (ex)colleagues I respect to > > get > > > their take and refine it. > > > > > > Here's a blog that hopefully explains this vision. > > > > > > https://github.com/apache/hudi/pull/3322 > > > > > > Look forward to your feedback on the PR. We are hoping to land this > early > > > next week, if everyone is aligned. > > > > > > Thanks > > > Vinoth > > > > > > On Wed, Apr 21, 2021 at 9:01 PM wei li wrote: > > > > > > > +1 , Cannot agree more. > > > > *aux metadata* and metatable, can make hudi have large preformance > > > > optimization on query end. > > > > Can continuous develop. > > > > cache service may the necessary component in cloud native > environment. > > > > > > > > On 2021/04/13 05:29:55, Vinoth Chandar wrote: > > > > > Hello all, > > > > > > > > > > Reading one more article today, positioning Hudi, as just a table > > > format, > > > > > made me wonder, if we have done enough justice in explaining what > we > > > have > > > > > built together here. > > > > > I tend to think of Hudi as the data lake platform, which has the > > > > following > > > > > components, of which - one if a table format, one is a > transactional > > > > > storage layer. > > > > > But the whole stack we have is definitely worth more than the sum > of > > > all > > > > > the parts IMO (speaking from my own experience from the past 10+ > > years > > > of > > > > > open source software dev). > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > a) *table format* : something that stores table schema, a metadata > > > table > > > > > that stores file listing today, and being extended to store column > > > ranges > > > > > and more in the future (RFC-27) > > > > > b) *aux metadata* : bloom filters, external record level indexes > > today, > > > > > bitmaps/interval trees and other advanced on-disk data structures > > > > tomorrow > > > > > c) *concurrency control* : we always supported MVCC based log based > > > > > concurrency (serialize writes into a time ordered log), and we now > > also > > > > > have OCC for batch merge workloads with 0.8.0. We will have > > multi-table > > > > and > > > > > fully non-blocking writers soon (see future work section of RFC-22) > > > > > d) *updates/deletes* : this is the bread-and-butter use-case for > > Hudi, > > > > but > > > > > we support primary/unique key constraints and we could add foreign > > keys > > > > as > > > > > an extension, once our transactions can span tables. > > > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > > > files, > > > > > cleans, compacts, clusters data, bootstraps existing data - all > these > > > > > actions working off each other without blocking one another. (for > > most > > > > > parts). > > > > > f) *data services*: we also have higher level functionality with > > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar > is > > > > > coming, ...and more), incremental ETL support, de-duplication, > commit > > > > > callbacks, pre-commit validations are coming, error tables have > been > > > > > proposed. I could also envision us building towards streaming > egress, > > > > data > > > > > monitoring. > > > > > > > > > > I also think we should build the following (subject to separate > > > > > DISCUSS/RFCs) > > > > > > > > > > g) *caching service*: Hudi specific caching service that can hold > > > mutable > > > > > data and serve oft-queried data across engines. > > > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. > Let's > > > > turn > > > > > it into a scalable, sharded metastore, that all engines can use to > > > obtain > > > > > any metadata. > > > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > > opposed > > > to > > > > > "ingests & manages storage of large analytical datasets over DFS > > (hdfs > > > or > > > > > cloud stores)." and convey the scope of our vision, > > > > > given we have already been bu
Re: [DISCUSS] Hudi is the data lake platform
Thanks Vino! Got a bunch of emoticons on the PR as well. Will land this monday, giving it more time over the weekend as well. On Wed, Jul 21, 2021 at 7:36 PM vino yang wrote: > Thanks vc > > Very good blog, in-depth and forward-looking. Learned! > > Best, > Vino > > Vinoth Chandar 于2021年7月22日周四 上午3:58写道: > > > Expanding to users@ as well. > > > > Hi all, > > > > Since this discussion, I started to pen down a coherent strategy and > convey > > these ideas via a blog post. > > I have also done my own research, talked to (ex)colleagues I respect to > get > > their take and refine it. > > > > Here's a blog that hopefully explains this vision. > > > > https://github.com/apache/hudi/pull/3322 > > > > Look forward to your feedback on the PR. We are hoping to land this early > > next week, if everyone is aligned. > > > > Thanks > > Vinoth > > > > On Wed, Apr 21, 2021 at 9:01 PM wei li wrote: > > > > > +1 , Cannot agree more. > > > *aux metadata* and metatable, can make hudi have large preformance > > > optimization on query end. > > > Can continuous develop. > > > cache service may the necessary component in cloud native environment. > > > > > > On 2021/04/13 05:29:55, Vinoth Chandar wrote: > > > > Hello all, > > > > > > > > Reading one more article today, positioning Hudi, as just a table > > format, > > > > made me wonder, if we have done enough justice in explaining what we > > have > > > > built together here. > > > > I tend to think of Hudi as the data lake platform, which has the > > > following > > > > components, of which - one if a table format, one is a transactional > > > > storage layer. > > > > But the whole stack we have is definitely worth more than the sum of > > all > > > > the parts IMO (speaking from my own experience from the past 10+ > years > > of > > > > open source software dev). > > > > > > > > Here's what we have built so far. > > > > > > > > a) *table format* : something that stores table schema, a metadata > > table > > > > that stores file listing today, and being extended to store column > > ranges > > > > and more in the future (RFC-27) > > > > b) *aux metadata* : bloom filters, external record level indexes > today, > > > > bitmaps/interval trees and other advanced on-disk data structures > > > tomorrow > > > > c) *concurrency control* : we always supported MVCC based log based > > > > concurrency (serialize writes into a time ordered log), and we now > also > > > > have OCC for batch merge workloads with 0.8.0. We will have > multi-table > > > and > > > > fully non-blocking writers soon (see future work section of RFC-22) > > > > d) *updates/deletes* : this is the bread-and-butter use-case for > Hudi, > > > but > > > > we support primary/unique key constraints and we could add foreign > keys > > > as > > > > an extension, once our transactions can span tables. > > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > > files, > > > > cleans, compacts, clusters data, bootstraps existing data - all these > > > > actions working off each other without blocking one another. (for > most > > > > parts). > > > > f) *data services*: we also have higher level functionality with > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > > > coming, ...and more), incremental ETL support, de-duplication, commit > > > > callbacks, pre-commit validations are coming, error tables have been > > > > proposed. I could also envision us building towards streaming egress, > > > data > > > > monitoring. > > > > > > > > I also think we should build the following (subject to separate > > > > DISCUSS/RFCs) > > > > > > > > g) *caching service*: Hudi specific caching service that can hold > > mutable > > > > data and serve oft-queried data across engines. > > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > > > turn > > > > it into a scalable, sharded metastore, that all engines can use to > > obtain > > > > any metadata. > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > opposed > > to > > > > "ingests & manages storage of large analytical datasets over DFS > (hdfs > > or > > > > cloud stores)." and convey the scope of our vision, > > > > given we have already been building towards that. It would also > provide > > > new > > > > contributors a good lens to look at the project from. > > > > > > > > (This is very similar to for e.g, the evolution of Kafka from a > pub-sub > > > > system, to an event streaming platform - with addition of > > > > MirrorMaker/Connect etc. ) > > > > > > > > Please share your thoughts! > > > > > > > > Thanks > > > > Vinoth > > > > > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
Thanks vc Very good blog, in-depth and forward-looking. Learned! Best, Vino Vinoth Chandar 于2021年7月22日周四 上午3:58写道: > Expanding to users@ as well. > > Hi all, > > Since this discussion, I started to pen down a coherent strategy and convey > these ideas via a blog post. > I have also done my own research, talked to (ex)colleagues I respect to get > their take and refine it. > > Here's a blog that hopefully explains this vision. > > https://github.com/apache/hudi/pull/3322 > > Look forward to your feedback on the PR. We are hoping to land this early > next week, if everyone is aligned. > > Thanks > Vinoth > > On Wed, Apr 21, 2021 at 9:01 PM wei li wrote: > > > +1 , Cannot agree more. > > *aux metadata* and metatable, can make hudi have large preformance > > optimization on query end. > > Can continuous develop. > > cache service may the necessary component in cloud native environment. > > > > On 2021/04/13 05:29:55, Vinoth Chandar wrote: > > > Hello all, > > > > > > Reading one more article today, positioning Hudi, as just a table > format, > > > made me wonder, if we have done enough justice in explaining what we > have > > > built together here. > > > I tend to think of Hudi as the data lake platform, which has the > > following > > > components, of which - one if a table format, one is a transactional > > > storage layer. > > > But the whole stack we have is definitely worth more than the sum of > all > > > the parts IMO (speaking from my own experience from the past 10+ years > of > > > open source software dev). > > > > > > Here's what we have built so far. > > > > > > a) *table format* : something that stores table schema, a metadata > table > > > that stores file listing today, and being extended to store column > ranges > > > and more in the future (RFC-27) > > > b) *aux metadata* : bloom filters, external record level indexes today, > > > bitmaps/interval trees and other advanced on-disk data structures > > tomorrow > > > c) *concurrency control* : we always supported MVCC based log based > > > concurrency (serialize writes into a time ordered log), and we now also > > > have OCC for batch merge workloads with 0.8.0. We will have multi-table > > and > > > fully non-blocking writers soon (see future work section of RFC-22) > > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, > > but > > > we support primary/unique key constraints and we could add foreign keys > > as > > > an extension, once our transactions can span tables. > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > files, > > > cleans, compacts, clusters data, bootstraps existing data - all these > > > actions working off each other without blocking one another. (for most > > > parts). > > > f) *data services*: we also have higher level functionality with > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > > coming, ...and more), incremental ETL support, de-duplication, commit > > > callbacks, pre-commit validations are coming, error tables have been > > > proposed. I could also envision us building towards streaming egress, > > data > > > monitoring. > > > > > > I also think we should build the following (subject to separate > > > DISCUSS/RFCs) > > > > > > g) *caching service*: Hudi specific caching service that can hold > mutable > > > data and serve oft-queried data across engines. > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > > turn > > > it into a scalable, sharded metastore, that all engines can use to > obtain > > > any metadata. > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed > to > > > "ingests & manages storage of large analytical datasets over DFS (hdfs > or > > > cloud stores)." and convey the scope of our vision, > > > given we have already been building towards that. It would also provide > > new > > > contributors a good lens to look at the project from. > > > > > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub > > > system, to an event streaming platform - with addition of > > > MirrorMaker/Connect etc. ) > > > > > > Please share your thoughts! > > > > > > Thanks > > > Vinoth > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
Expanding to users@ as well. Hi all, Since this discussion, I started to pen down a coherent strategy and convey these ideas via a blog post. I have also done my own research, talked to (ex)colleagues I respect to get their take and refine it. Here's a blog that hopefully explains this vision. https://github.com/apache/hudi/pull/3322 Look forward to your feedback on the PR. We are hoping to land this early next week, if everyone is aligned. Thanks Vinoth On Wed, Apr 21, 2021 at 9:01 PM wei li wrote: > +1 , Cannot agree more. > *aux metadata* and metatable, can make hudi have large preformance > optimization on query end. > Can continuous develop. > cache service may the necessary component in cloud native environment. > > On 2021/04/13 05:29:55, Vinoth Chandar wrote: > > Hello all, > > > > Reading one more article today, positioning Hudi, as just a table format, > > made me wonder, if we have done enough justice in explaining what we have > > built together here. > > I tend to think of Hudi as the data lake platform, which has the > following > > components, of which - one if a table format, one is a transactional > > storage layer. > > But the whole stack we have is definitely worth more than the sum of all > > the parts IMO (speaking from my own experience from the past 10+ years of > > open source software dev). > > > > Here's what we have built so far. > > > > a) *table format* : something that stores table schema, a metadata table > > that stores file listing today, and being extended to store column ranges > > and more in the future (RFC-27) > > b) *aux metadata* : bloom filters, external record level indexes today, > > bitmaps/interval trees and other advanced on-disk data structures > tomorrow > > c) *concurrency control* : we always supported MVCC based log based > > concurrency (serialize writes into a time ordered log), and we now also > > have OCC for batch merge workloads with 0.8.0. We will have multi-table > and > > fully non-blocking writers soon (see future work section of RFC-22) > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, > but > > we support primary/unique key constraints and we could add foreign keys > as > > an extension, once our transactions can span tables. > > e) *table services*: a hudi pipeline today is self-managing - sizes > files, > > cleans, compacts, clusters data, bootstraps existing data - all these > > actions working off each other without blocking one another. (for most > > parts). > > f) *data services*: we also have higher level functionality with > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > coming, ...and more), incremental ETL support, de-duplication, commit > > callbacks, pre-commit validations are coming, error tables have been > > proposed. I could also envision us building towards streaming egress, > data > > monitoring. > > > > I also think we should build the following (subject to separate > > DISCUSS/RFCs) > > > > g) *caching service*: Hudi specific caching service that can hold mutable > > data and serve oft-queried data across engines. > > h) t*imeline metaserver:* We already run a metaserver in spark > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > turn > > it into a scalable, sharded metastore, that all engines can use to obtain > > any metadata. > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to > > "ingests & manages storage of large analytical datasets over DFS (hdfs or > > cloud stores)." and convey the scope of our vision, > > given we have already been building towards that. It would also provide > new > > contributors a good lens to look at the project from. > > > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub > > system, to an event streaming platform - with addition of > > MirrorMaker/Connect etc. ) > > > > Please share your thoughts! > > > > Thanks > > Vinoth > > >
Re: [DISCUSS] Hudi is the data lake platform
+1 , Cannot agree more. *aux metadata* and metatable, can make hudi have large preformance optimization on query end. Can continuous develop. cache service may the necessary component in cloud native environment. On 2021/04/13 05:29:55, Vinoth Chandar wrote: > Hello all, > > Reading one more article today, positioning Hudi, as just a table format, > made me wonder, if we have done enough justice in explaining what we have > built together here. > I tend to think of Hudi as the data lake platform, which has the following > components, of which - one if a table format, one is a transactional > storage layer. > But the whole stack we have is definitely worth more than the sum of all > the parts IMO (speaking from my own experience from the past 10+ years of > open source software dev). > > Here's what we have built so far. > > a) *table format* : something that stores table schema, a metadata table > that stores file listing today, and being extended to store column ranges > and more in the future (RFC-27) > b) *aux metadata* : bloom filters, external record level indexes today, > bitmaps/interval trees and other advanced on-disk data structures tomorrow > c) *concurrency control* : we always supported MVCC based log based > concurrency (serialize writes into a time ordered log), and we now also > have OCC for batch merge workloads with 0.8.0. We will have multi-table and > fully non-blocking writers soon (see future work section of RFC-22) > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but > we support primary/unique key constraints and we could add foreign keys as > an extension, once our transactions can span tables. > e) *table services*: a hudi pipeline today is self-managing - sizes files, > cleans, compacts, clusters data, bootstraps existing data - all these > actions working off each other without blocking one another. (for most > parts). > f) *data services*: we also have higher level functionality with > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > coming, ...and more), incremental ETL support, de-duplication, commit > callbacks, pre-commit validations are coming, error tables have been > proposed. I could also envision us building towards streaming egress, data > monitoring. > > I also think we should build the following (subject to separate > DISCUSS/RFCs) > > g) *caching service*: Hudi specific caching service that can hold mutable > data and serve oft-queried data across engines. > h) t*imeline metaserver:* We already run a metaserver in spark > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn > it into a scalable, sharded metastore, that all engines can use to obtain > any metadata. > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to > "ingests & manages storage of large analytical datasets over DFS (hdfs or > cloud stores)." and convey the scope of our vision, > given we have already been building towards that. It would also provide new > contributors a good lens to look at the project from. > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub > system, to an event streaming platform - with addition of > MirrorMaker/Connect etc. ) > > Please share your thoughts! > > Thanks > Vinoth >
Re: [DISCUSS] Hudi is the data lake platform
Looks like we have consensus here! Will share the blog PR here once ready. Thanks all! On Fri, Apr 16, 2021 at 8:43 PM Sivabalan wrote: > totally +1 on clarifying Hudi's vision. > > On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal > wrote: > > > +1 > > > > I also believe Hudi is a Data Platform technology providing many > different > > functionalities to build modern data lakes, Hudi's table format being > just > > one of them. I've been using this perspective in some of the conference > > talks already ;) > > With this rebranding (and hopefully some code/package structuring down > the > > road..), it's easier for us to communicate the value add of Hudi and its > > associated features and generate interest for future contributors. > > > > Thanks, > > Nishith > > > > > > On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar > wrote: > > > > > Thanks everyone for the feedback, so far! > > > > > > On the incremental aspects, that's actually Hudi's core design > > > differentiation. While I believe the ETL today is still largely batch > > > oriented, the way forward for everyone's > > > benefit is indeed - incremental processing. We have already taken a > giant > > > step here for e.g in making raw data ingestion fully incremental using > > > deltastreamer. We should keep working to crack incremental ETL at > large. > > > 100% with your line of thinking! > > > > > > It's been in my head for four full years now! :) > > > > > > > > > https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/ > > > > > > I have started drafting a blog/PR along these lines already. I will > make > > it > > > more final and share here, as we wait couple more days for more > feedback! > > > > > > Thanks > > > Vinoth > > > > > > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan > wrote: > > > > > > > +1 for the vision, personally i'm promising the incremental ETL part, > > > with > > > > engine like Apache Flink we can do intermediate aggregation in > > streaming > > > > style. > > > > > > > > Best, > > > > Danny Chan > > > > > > > > leesf 于2021年4月14日周三 上午9:52写道: > > > > > > > > > +1. Cool and promising. > > > > > > > > > > Mehrotra, Udit 于2021年4月14日周三 上午2:57写道: > > > > > > > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table > format" > > > and > > > > > we > > > > > > need to do justice to all the cool auxiliary features/services we > > > have > > > > > > built. > > > > > > > > > > > > Also, timeline metadata service in particular would be a really > big > > > win > > > > > if > > > > > > we move towards something like that. > > > > > > > > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" > > > > > wrote: > > > > > > > > > > > > CAUTION: This email originated from outside of the > > organization. > > > Do > > > > > > not click links or open attachments unless you can confirm the > > sender > > > > and > > > > > > know the content is safe. > > > > > > > > > > > > > > > > > > > > > > > > Definitely we are doing much more than only ingesting and > > > managing > > > > > data > > > > > > over DFS. > > > > > > > > > > > > +1 from my side as well. :) > > > > > > > > > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong < > > susudo...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > I love this rebranding. Totally agree. +1 > > > > > > > > > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu < > > > > > > xu.shiyan.raym...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > +1 The vision looks fantastic. > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li < > gar...@apache.org > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Awesome summary of Hudi! +1 as well. > > > > > > > > > > > > > > > > > > Gary Li > > > > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues < > > > > > > rubenssoto2...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > Excellent, I agree > > > > > > > > > > > > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang < > > > > > > yanghua1...@gmail.com> > > > > > > > > > escreveu: > > > > > > > > > > > > > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Vino > > > > > > > > > > > > > > > > > > > > > > Dianjin Wang > > > > > 于2021年4月13日周二 > > > > > > > > 下午3:53写道: > > > > > > > > > > > > > > > > > > > > > > > +1 The new brand is straightforward, a better > > > > > description > > > > > > of > > > > > > > Hudi. > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > > > > > > > bhavanisud...@gmail.com> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes > tota
Re: [DISCUSS] Hudi is the data lake platform
totally +1 on clarifying Hudi's vision. On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal wrote: > +1 > > I also believe Hudi is a Data Platform technology providing many different > functionalities to build modern data lakes, Hudi's table format being just > one of them. I've been using this perspective in some of the conference > talks already ;) > With this rebranding (and hopefully some code/package structuring down the > road..), it's easier for us to communicate the value add of Hudi and its > associated features and generate interest for future contributors. > > Thanks, > Nishith > > > On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar wrote: > > > Thanks everyone for the feedback, so far! > > > > On the incremental aspects, that's actually Hudi's core design > > differentiation. While I believe the ETL today is still largely batch > > oriented, the way forward for everyone's > > benefit is indeed - incremental processing. We have already taken a giant > > step here for e.g in making raw data ingestion fully incremental using > > deltastreamer. We should keep working to crack incremental ETL at large. > > 100% with your line of thinking! > > > > It's been in my head for four full years now! :) > > > > > https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/ > > > > I have started drafting a blog/PR along these lines already. I will make > it > > more final and share here, as we wait couple more days for more feedback! > > > > Thanks > > Vinoth > > > > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan wrote: > > > > > +1 for the vision, personally i'm promising the incremental ETL part, > > with > > > engine like Apache Flink we can do intermediate aggregation in > streaming > > > style. > > > > > > Best, > > > Danny Chan > > > > > > leesf 于2021年4月14日周三 上午9:52写道: > > > > > > > +1. Cool and promising. > > > > > > > > Mehrotra, Udit 于2021年4月14日周三 上午2:57写道: > > > > > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table format" > > and > > > > we > > > > > need to do justice to all the cool auxiliary features/services we > > have > > > > > built. > > > > > > > > > > Also, timeline metadata service in particular would be a really big > > win > > > > if > > > > > we move towards something like that. > > > > > > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" > > > wrote: > > > > > > > > > > CAUTION: This email originated from outside of the > organization. > > Do > > > > > not click links or open attachments unless you can confirm the > sender > > > and > > > > > know the content is safe. > > > > > > > > > > > > > > > > > > > > Definitely we are doing much more than only ingesting and > > managing > > > > data > > > > > over DFS. > > > > > > > > > > +1 from my side as well. :) > > > > > > > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong < > susudo...@gmail.com> > > > > > wrote: > > > > > > > > > > > I love this rebranding. Totally agree. +1 > > > > > > > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu < > > > > > xu.shiyan.raym...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > +1 The vision looks fantastic. > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li > > > > > wrote: > > > > > > > > > > > > > > > Awesome summary of Hudi! +1 as well. > > > > > > > > > > > > > > > > Gary Li > > > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues < > > > > > rubenssoto2...@gmail.com> > > > > > > > > wrote: > > > > > > > > > Excellent, I agree > > > > > > > > > > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang < > > > > > yanghua1...@gmail.com> > > > > > > > > escreveu: > > > > > > > > > > > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Vino > > > > > > > > > > > > > > > > > > > > Dianjin Wang > > > > 于2021年4月13日周二 > > > > > > > 下午3:53写道: > > > > > > > > > > > > > > > > > > > > > +1 The new brand is straightforward, a better > > > > description > > > > > of > > > > > > Hudi. > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > > > > > > bhavanisud...@gmail.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total > > > sense > > > > > and will > > > > > > > > provide > > > > > > > > > > > for > > > > > > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > > > > > > vin...@apache.org > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
+1 I also believe Hudi is a Data Platform technology providing many different functionalities to build modern data lakes, Hudi's table format being just one of them. I've been using this perspective in some of the conference talks already ;) With this rebranding (and hopefully some code/package structuring down the road..), it's easier for us to communicate the value add of Hudi and its associated features and generate interest for future contributors. Thanks, Nishith On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar wrote: > Thanks everyone for the feedback, so far! > > On the incremental aspects, that's actually Hudi's core design > differentiation. While I believe the ETL today is still largely batch > oriented, the way forward for everyone's > benefit is indeed - incremental processing. We have already taken a giant > step here for e.g in making raw data ingestion fully incremental using > deltastreamer. We should keep working to crack incremental ETL at large. > 100% with your line of thinking! > > It's been in my head for four full years now! :) > > https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/ > > I have started drafting a blog/PR along these lines already. I will make it > more final and share here, as we wait couple more days for more feedback! > > Thanks > Vinoth > > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan wrote: > > > +1 for the vision, personally i'm promising the incremental ETL part, > with > > engine like Apache Flink we can do intermediate aggregation in streaming > > style. > > > > Best, > > Danny Chan > > > > leesf 于2021年4月14日周三 上午9:52写道: > > > > > +1. Cool and promising. > > > > > > Mehrotra, Udit 于2021年4月14日周三 上午2:57写道: > > > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table format" > and > > > we > > > > need to do justice to all the cool auxiliary features/services we > have > > > > built. > > > > > > > > Also, timeline metadata service in particular would be a really big > win > > > if > > > > we move towards something like that. > > > > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" > > wrote: > > > > > > > > CAUTION: This email originated from outside of the organization. > Do > > > > not click links or open attachments unless you can confirm the sender > > and > > > > know the content is safe. > > > > > > > > > > > > > > > > Definitely we are doing much more than only ingesting and > managing > > > data > > > > over DFS. > > > > > > > > +1 from my side as well. :) > > > > > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong > > > > wrote: > > > > > > > > > I love this rebranding. Totally agree. +1 > > > > > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu < > > > > xu.shiyan.raym...@gmail.com> > > > > > wrote: > > > > > > > > > > > +1 The vision looks fantastic. > > > > > > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li > > > wrote: > > > > > > > > > > > > > Awesome summary of Hudi! +1 as well. > > > > > > > > > > > > > > Gary Li > > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues < > > > > rubenssoto2...@gmail.com> > > > > > > > wrote: > > > > > > > > Excellent, I agree > > > > > > > > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang < > > > > yanghua1...@gmail.com> > > > > > > > escreveu: > > > > > > > > > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Vino > > > > > > > > > > > > > > > > > > Dianjin Wang > > > 于2021年4月13日周二 > > > > > > 下午3:53写道: > > > > > > > > > > > > > > > > > > > +1 The new brand is straightforward, a better > > > description > > > > of > > > > > Hudi. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > > > > > bhavanisud...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total > > sense > > > > and will > > > > > > > provide > > > > > > > > > > for > > > > > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > > > > > vin...@apache.org > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, > > as > > > > just a > > > > > > table > > > > > > > > > > format, > > > > > > > > > > > > made me wonder, if we have done enough justice in > > > > explaining > > > > > > > what we > > > > > > > > > > have > > > > > > > > > > > > built together here. > > > > > > > > > > > > I tend to think of Hudi as the data lake > platfor
Re: [DISCUSS] Hudi is the data lake platform
Thanks everyone for the feedback, so far! On the incremental aspects, that's actually Hudi's core design differentiation. While I believe the ETL today is still largely batch oriented, the way forward for everyone's benefit is indeed - incremental processing. We have already taken a giant step here for e.g in making raw data ingestion fully incremental using deltastreamer. We should keep working to crack incremental ETL at large. 100% with your line of thinking! It's been in my head for four full years now! :) https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/ I have started drafting a blog/PR along these lines already. I will make it more final and share here, as we wait couple more days for more feedback! Thanks Vinoth On Tue, Apr 13, 2021 at 7:01 PM Danny Chan wrote: > +1 for the vision, personally i'm promising the incremental ETL part, with > engine like Apache Flink we can do intermediate aggregation in streaming > style. > > Best, > Danny Chan > > leesf 于2021年4月14日周三 上午9:52写道: > > > +1. Cool and promising. > > > > Mehrotra, Udit 于2021年4月14日周三 上午2:57写道: > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table format" and > > we > > > need to do justice to all the cool auxiliary features/services we have > > > built. > > > > > > Also, timeline metadata service in particular would be a really big win > > if > > > we move towards something like that. > > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" > wrote: > > > > > > CAUTION: This email originated from outside of the organization. Do > > > not click links or open attachments unless you can confirm the sender > and > > > know the content is safe. > > > > > > > > > > > > Definitely we are doing much more than only ingesting and managing > > data > > > over DFS. > > > > > > +1 from my side as well. :) > > > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong > > > wrote: > > > > > > > I love this rebranding. Totally agree. +1 > > > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu < > > > xu.shiyan.raym...@gmail.com> > > > > wrote: > > > > > > > > > +1 The vision looks fantastic. > > > > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li > > wrote: > > > > > > > > > > > Awesome summary of Hudi! +1 as well. > > > > > > > > > > > > Gary Li > > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues < > > > rubenssoto2...@gmail.com> > > > > > > wrote: > > > > > > > Excellent, I agree > > > > > > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang < > > > yanghua1...@gmail.com> > > > > > > escreveu: > > > > > > > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > > > > > > > Best, > > > > > > > > Vino > > > > > > > > > > > > > > > > Dianjin Wang > > 于2021年4月13日周二 > > > > > 下午3:53写道: > > > > > > > > > > > > > > > > > +1 The new brand is straightforward, a better > > description > > > of > > > > Hudi. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > > > > bhavanisud...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total > sense > > > and will > > > > > > provide > > > > > > > > > for > > > > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > > > > vin...@apache.org > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, > as > > > just a > > > > > table > > > > > > > > > format, > > > > > > > > > > > made me wonder, if we have done enough justice in > > > explaining > > > > > > what we > > > > > > > > > have > > > > > > > > > > > built together here. > > > > > > > > > > > I tend to think of Hudi as the data lake platform, > > > which has > > > > > the > > > > > > > > > > following > > > > > > > > > > > components, of which - one if a table format, one > is > > a > > > > > > transactional > > > > > > > > > > > storage layer. > > > > > > > > > > > But the whole stack we have is definitely worth > more > > > than the > > > > > > sum of > > > > > > > > > all > > > > > > > > > > > the parts IMO (speaking from my own experience from > > > the past > > > > > 10+ > > > > > > > > years > > > > > > > > > of > > > > > > > > > > > open source software dev). > > > > > > > > > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > > > > > > > > > a) *table format* : something that stores table > > > schema, a > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
+1 for the vision, personally i'm promising the incremental ETL part, with engine like Apache Flink we can do intermediate aggregation in streaming style. Best, Danny Chan leesf 于2021年4月14日周三 上午9:52写道: > +1. Cool and promising. > > Mehrotra, Udit 于2021年4月14日周三 上午2:57写道: > > > Agree with the rebranding Vinoth. Hudi is not just a "table format" and > we > > need to do justice to all the cool auxiliary features/services we have > > built. > > > > Also, timeline metadata service in particular would be a really big win > if > > we move towards something like that. > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" wrote: > > > > CAUTION: This email originated from outside of the organization. Do > > not click links or open attachments unless you can confirm the sender and > > know the content is safe. > > > > > > > > Definitely we are doing much more than only ingesting and managing > data > > over DFS. > > > > +1 from my side as well. :) > > > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong > > wrote: > > > > > I love this rebranding. Totally agree. +1 > > > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu < > > xu.shiyan.raym...@gmail.com> > > > wrote: > > > > > > > +1 The vision looks fantastic. > > > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li > wrote: > > > > > > > > > Awesome summary of Hudi! +1 as well. > > > > > > > > > > Gary Li > > > > > On 2021/04/13 14:13:24, Rubens Rodrigues < > > rubenssoto2...@gmail.com> > > > > > wrote: > > > > > > Excellent, I agree > > > > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang < > > yanghua1...@gmail.com> > > > > > escreveu: > > > > > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > > > > > Best, > > > > > > > Vino > > > > > > > > > > > > > > Dianjin Wang > 于2021年4月13日周二 > > > > 下午3:53写道: > > > > > > > > > > > > > > > +1 The new brand is straightforward, a better > description > > of > > > Hudi. > > > > > > > > > > > > > > > > Best, > > > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > > > bhavanisud...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total sense > > and will > > > > > provide > > > > > > > > for > > > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > > > vin...@apache.org > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, as > > just a > > > > table > > > > > > > > format, > > > > > > > > > > made me wonder, if we have done enough justice in > > explaining > > > > > what we > > > > > > > > have > > > > > > > > > > built together here. > > > > > > > > > > I tend to think of Hudi as the data lake platform, > > which has > > > > the > > > > > > > > > following > > > > > > > > > > components, of which - one if a table format, one is > a > > > > > transactional > > > > > > > > > > storage layer. > > > > > > > > > > But the whole stack we have is definitely worth more > > than the > > > > > sum of > > > > > > > > all > > > > > > > > > > the parts IMO (speaking from my own experience from > > the past > > > > 10+ > > > > > > > years > > > > > > > > of > > > > > > > > > > open source software dev). > > > > > > > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > > > > > > > a) *table format* : something that stores table > > schema, a > > > > > metadata > > > > > > > > table > > > > > > > > > > that stores file listing today, and being extended to > > store > > > > > column > > > > > > > > ranges > > > > > > > > > > and more in the future (RFC-27) > > > > > > > > > > b) *aux metadata* : bloom filters, external record > > level > > > > indexes > > > > > > > today, > > > > > > > > > > bitmaps/interval trees and other advanced on-disk > data > > > > structures > > > > > > > > > tomorrow > > > > > > > > > > c) *concurrency control* : we always supported MVCC > > based log > > > > > based > > > > > > > > > > concurrency (serialize writes into a time ordered > > log), and > > > we > > > > > now > > > > > > > also > > > > > > > > > > have OCC for batch merge workloads with 0.8.0. We > will > > have > > > > > > > multi-table > > > > > > > > > and > > > > > > > > > > fully non-blocking writers soon (see future work > > section of > > > > > RFC-22) > > > > > > > > > > d) *updates/deletes* : this is the bread-and-butter > > use-cas
Re: [DISCUSS] Hudi is the data lake platform
+1. Cool and promising. Mehrotra, Udit 于2021年4月14日周三 上午2:57写道: > Agree with the rebranding Vinoth. Hudi is not just a "table format" and we > need to do justice to all the cool auxiliary features/services we have > built. > > Also, timeline metadata service in particular would be a really big win if > we move towards something like that. > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" wrote: > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender and > know the content is safe. > > > > Definitely we are doing much more than only ingesting and managing data > over DFS. > > +1 from my side as well. :) > > On Tue, Apr 13, 2021 at 10:02 PM Susu Dong > wrote: > > > I love this rebranding. Totally agree. +1 > > > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu < > xu.shiyan.raym...@gmail.com> > > wrote: > > > > > +1 The vision looks fantastic. > > > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li wrote: > > > > > > > Awesome summary of Hudi! +1 as well. > > > > > > > > Gary Li > > > > On 2021/04/13 14:13:24, Rubens Rodrigues < > rubenssoto2...@gmail.com> > > > > wrote: > > > > > Excellent, I agree > > > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang < > yanghua1...@gmail.com> > > > > escreveu: > > > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > > > Best, > > > > > > Vino > > > > > > > > > > > > Dianjin Wang 于2021年4月13日周二 > > > 下午3:53写道: > > > > > > > > > > > > > +1 The new brand is straightforward, a better description > of > > Hudi. > > > > > > > > > > > > > > Best, > > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > > bhavanisud...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total sense > and will > > > > provide > > > > > > > for > > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > > vin...@apache.org > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, as > just a > > > table > > > > > > > format, > > > > > > > > > made me wonder, if we have done enough justice in > explaining > > > > what we > > > > > > > have > > > > > > > > > built together here. > > > > > > > > > I tend to think of Hudi as the data lake platform, > which has > > > the > > > > > > > > following > > > > > > > > > components, of which - one if a table format, one is a > > > > transactional > > > > > > > > > storage layer. > > > > > > > > > But the whole stack we have is definitely worth more > than the > > > > sum of > > > > > > > all > > > > > > > > > the parts IMO (speaking from my own experience from > the past > > > 10+ > > > > > > years > > > > > > > of > > > > > > > > > open source software dev). > > > > > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > > > > > a) *table format* : something that stores table > schema, a > > > > metadata > > > > > > > table > > > > > > > > > that stores file listing today, and being extended to > store > > > > column > > > > > > > ranges > > > > > > > > > and more in the future (RFC-27) > > > > > > > > > b) *aux metadata* : bloom filters, external record > level > > > indexes > > > > > > today, > > > > > > > > > bitmaps/interval trees and other advanced on-disk data > > > structures > > > > > > > > tomorrow > > > > > > > > > c) *concurrency control* : we always supported MVCC > based log > > > > based > > > > > > > > > concurrency (serialize writes into a time ordered > log), and > > we > > > > now > > > > > > also > > > > > > > > > have OCC for batch merge workloads with 0.8.0. We will > have > > > > > > multi-table > > > > > > > > and > > > > > > > > > fully non-blocking writers soon (see future work > section of > > > > RFC-22) > > > > > > > > > d) *updates/deletes* : this is the bread-and-butter > use-case > > > for > > > > > > Hudi, > > > > > > > > but > > > > > > > > > we support primary/unique key constraints and we could > add > > > > foreign > > > > > > keys > > > > > > > > as > > > > > > > > > an extension, once our transactions can span tables. > > > > > > > > > e) *table services*: a hudi pipeline today is > self-managing - > > > > sizes > > > > > > > > files, > > > > > > > > > cleans, compacts, clusters data, bootstraps existing > data - > > all
Re: [DISCUSS] Hudi is the data lake platform
Agree with the rebranding Vinoth. Hudi is not just a "table format" and we need to do justice to all the cool auxiliary features/services we have built. Also, timeline metadata service in particular would be a really big win if we move towards something like that. On 4/13/21, 11:01 AM, "Pratyaksh Sharma" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Definitely we are doing much more than only ingesting and managing data over DFS. +1 from my side as well. :) On Tue, Apr 13, 2021 at 10:02 PM Susu Dong wrote: > I love this rebranding. Totally agree. +1 > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu > wrote: > > > +1 The vision looks fantastic. > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li wrote: > > > > > Awesome summary of Hudi! +1 as well. > > > > > > Gary Li > > > On 2021/04/13 14:13:24, Rubens Rodrigues > > > wrote: > > > > Excellent, I agree > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang > > > escreveu: > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > Best, > > > > > Vino > > > > > > > > > > Dianjin Wang 于2021年4月13日周二 > > 下午3:53写道: > > > > > > > > > > > +1 The new brand is straightforward, a better description of > Hudi. > > > > > > > > > > > > Best, > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > bhavanisud...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total sense and will > > > provide > > > > > > for > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > vin...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, as just a > > table > > > > > > format, > > > > > > > > made me wonder, if we have done enough justice in explaining > > > what we > > > > > > have > > > > > > > > built together here. > > > > > > > > I tend to think of Hudi as the data lake platform, which has > > the > > > > > > > following > > > > > > > > components, of which - one if a table format, one is a > > > transactional > > > > > > > > storage layer. > > > > > > > > But the whole stack we have is definitely worth more than the > > > sum of > > > > > > all > > > > > > > > the parts IMO (speaking from my own experience from the past > > 10+ > > > > > years > > > > > > of > > > > > > > > open source software dev). > > > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > > > a) *table format* : something that stores table schema, a > > > metadata > > > > > > table > > > > > > > > that stores file listing today, and being extended to store > > > column > > > > > > ranges > > > > > > > > and more in the future (RFC-27) > > > > > > > > b) *aux metadata* : bloom filters, external record level > > indexes > > > > > today, > > > > > > > > bitmaps/interval trees and other advanced on-disk data > > structures > > > > > > > tomorrow > > > > > > > > c) *concurrency control* : we always supported MVCC based log > > > based > > > > > > > > concurrency (serialize writes into a time ordered log), and > we > > > now > > > > > also > > > > > > > > have OCC for batch merge workloads with 0.8.0. We will have > > > > > multi-table > > > > > > > and > > > > > > > > fully non-blocking writers soon (see future work section of > > > RFC-22) > > > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case > > for > > > > > Hudi, > > > > > > > but > > > > > > > > we support primary/unique key constraints and we could add > > > foreign > > > > > keys > > > > > > > as > > > > > > > > an extension, once our transactions can span tables. > > > > > > > > e) *table services*: a hudi pipeline today is self-managing - > > > sizes > > > > > > > files, > > > > > > > > cleans, compacts, clusters data, bootstraps existing data - > all > > > these > > > > > > > > actions working off each other without blocking one another. > > (for > > > > > most > > > > > > > > parts). > > > > > > > > f) *data services*: we also have higher level functionality > > with > > > > > > > > deltastreamer sources (scalable DFS listing source, Kafka, > > > Pulsar is > > > > > > > > coming, ...and more), incremental ETL support, > de-duplication,
Re: [DISCUSS] Hudi is the data lake platform
Definitely we are doing much more than only ingesting and managing data over DFS. +1 from my side as well. :) On Tue, Apr 13, 2021 at 10:02 PM Susu Dong wrote: > I love this rebranding. Totally agree. +1 > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu > wrote: > > > +1 The vision looks fantastic. > > > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li wrote: > > > > > Awesome summary of Hudi! +1 as well. > > > > > > Gary Li > > > On 2021/04/13 14:13:24, Rubens Rodrigues > > > wrote: > > > > Excellent, I agree > > > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang > > > escreveu: > > > > > > > > > +1 Excited by this new vision! > > > > > > > > > > Best, > > > > > Vino > > > > > > > > > > Dianjin Wang 于2021年4月13日周二 > > 下午3:53写道: > > > > > > > > > > > +1 The new brand is straightforward, a better description of > Hudi. > > > > > > > > > > > > Best, > > > > > > Dianjin Wang > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > > bhavanisud...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > +1 . Cannot agree more. I think this makes total sense and will > > > provide > > > > > > for > > > > > > > a much better representation of the project. > > > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > > vin...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, as just a > > table > > > > > > format, > > > > > > > > made me wonder, if we have done enough justice in explaining > > > what we > > > > > > have > > > > > > > > built together here. > > > > > > > > I tend to think of Hudi as the data lake platform, which has > > the > > > > > > > following > > > > > > > > components, of which - one if a table format, one is a > > > transactional > > > > > > > > storage layer. > > > > > > > > But the whole stack we have is definitely worth more than the > > > sum of > > > > > > all > > > > > > > > the parts IMO (speaking from my own experience from the past > > 10+ > > > > > years > > > > > > of > > > > > > > > open source software dev). > > > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > > > a) *table format* : something that stores table schema, a > > > metadata > > > > > > table > > > > > > > > that stores file listing today, and being extended to store > > > column > > > > > > ranges > > > > > > > > and more in the future (RFC-27) > > > > > > > > b) *aux metadata* : bloom filters, external record level > > indexes > > > > > today, > > > > > > > > bitmaps/interval trees and other advanced on-disk data > > structures > > > > > > > tomorrow > > > > > > > > c) *concurrency control* : we always supported MVCC based log > > > based > > > > > > > > concurrency (serialize writes into a time ordered log), and > we > > > now > > > > > also > > > > > > > > have OCC for batch merge workloads with 0.8.0. We will have > > > > > multi-table > > > > > > > and > > > > > > > > fully non-blocking writers soon (see future work section of > > > RFC-22) > > > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case > > for > > > > > Hudi, > > > > > > > but > > > > > > > > we support primary/unique key constraints and we could add > > > foreign > > > > > keys > > > > > > > as > > > > > > > > an extension, once our transactions can span tables. > > > > > > > > e) *table services*: a hudi pipeline today is self-managing - > > > sizes > > > > > > > files, > > > > > > > > cleans, compacts, clusters data, bootstraps existing data - > all > > > these > > > > > > > > actions working off each other without blocking one another. > > (for > > > > > most > > > > > > > > parts). > > > > > > > > f) *data services*: we also have higher level functionality > > with > > > > > > > > deltastreamer sources (scalable DFS listing source, Kafka, > > > Pulsar is > > > > > > > > coming, ...and more), incremental ETL support, > de-duplication, > > > commit > > > > > > > > callbacks, pre-commit validations are coming, error tables > have > > > been > > > > > > > > proposed. I could also envision us building towards streaming > > > egress, > > > > > > > data > > > > > > > > monitoring. > > > > > > > > > > > > > > > > I also think we should build the following (subject to > separate > > > > > > > > DISCUSS/RFCs) > > > > > > > > > > > > > > > > g) *caching service*: Hudi specific caching service that can > > hold > > > > > > mutable > > > > > > > > data and serve oft-queried data across engines. > > > > > > > > h) t*imeline metaserver:* We already run a metaserver in > spark > > > > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata > table. > > > Let's > > > > > > > turn > > > > > > > > it into a scalable, sharded metastore, that all engines can > use > > > to > > > > > > obtain > > > > > > > > any metadata. > > > > > > > > > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" > as > > > > > opposed > >
Re: [DISCUSS] Hudi is the data lake platform
I love this rebranding. Totally agree. +1 On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu wrote: > +1 The vision looks fantastic. > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li wrote: > > > Awesome summary of Hudi! +1 as well. > > > > Gary Li > > On 2021/04/13 14:13:24, Rubens Rodrigues > > wrote: > > > Excellent, I agree > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang > > escreveu: > > > > > > > +1 Excited by this new vision! > > > > > > > > Best, > > > > Vino > > > > > > > > Dianjin Wang 于2021年4月13日周二 > 下午3:53写道: > > > > > > > > > +1 The new brand is straightforward, a better description of Hudi. > > > > > > > > > > Best, > > > > > Dianjin Wang > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > bhavanisud...@gmail.com> > > > > > wrote: > > > > > > > > > > > +1 . Cannot agree more. I think this makes total sense and will > > provide > > > > > for > > > > > > a much better representation of the project. > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > vin...@apache.org > > > > > > > > wrote: > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, as just a > table > > > > > format, > > > > > > > made me wonder, if we have done enough justice in explaining > > what we > > > > > have > > > > > > > built together here. > > > > > > > I tend to think of Hudi as the data lake platform, which has > the > > > > > > following > > > > > > > components, of which - one if a table format, one is a > > transactional > > > > > > > storage layer. > > > > > > > But the whole stack we have is definitely worth more than the > > sum of > > > > > all > > > > > > > the parts IMO (speaking from my own experience from the past > 10+ > > > > years > > > > > of > > > > > > > open source software dev). > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > a) *table format* : something that stores table schema, a > > metadata > > > > > table > > > > > > > that stores file listing today, and being extended to store > > column > > > > > ranges > > > > > > > and more in the future (RFC-27) > > > > > > > b) *aux metadata* : bloom filters, external record level > indexes > > > > today, > > > > > > > bitmaps/interval trees and other advanced on-disk data > structures > > > > > > tomorrow > > > > > > > c) *concurrency control* : we always supported MVCC based log > > based > > > > > > > concurrency (serialize writes into a time ordered log), and we > > now > > > > also > > > > > > > have OCC for batch merge workloads with 0.8.0. We will have > > > > multi-table > > > > > > and > > > > > > > fully non-blocking writers soon (see future work section of > > RFC-22) > > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case > for > > > > Hudi, > > > > > > but > > > > > > > we support primary/unique key constraints and we could add > > foreign > > > > keys > > > > > > as > > > > > > > an extension, once our transactions can span tables. > > > > > > > e) *table services*: a hudi pipeline today is self-managing - > > sizes > > > > > > files, > > > > > > > cleans, compacts, clusters data, bootstraps existing data - all > > these > > > > > > > actions working off each other without blocking one another. > (for > > > > most > > > > > > > parts). > > > > > > > f) *data services*: we also have higher level functionality > with > > > > > > > deltastreamer sources (scalable DFS listing source, Kafka, > > Pulsar is > > > > > > > coming, ...and more), incremental ETL support, de-duplication, > > commit > > > > > > > callbacks, pre-commit validations are coming, error tables have > > been > > > > > > > proposed. I could also envision us building towards streaming > > egress, > > > > > > data > > > > > > > monitoring. > > > > > > > > > > > > > > I also think we should build the following (subject to separate > > > > > > > DISCUSS/RFCs) > > > > > > > > > > > > > > g) *caching service*: Hudi specific caching service that can > hold > > > > > mutable > > > > > > > data and serve oft-queried data across engines. > > > > > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. > > Let's > > > > > > turn > > > > > > > it into a scalable, sharded metastore, that all engines can use > > to > > > > > obtain > > > > > > > any metadata. > > > > > > > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > > > > opposed > > > > > to > > > > > > > "ingests & manages storage of large analytical datasets over > DFS > > > > (hdfs > > > > > or > > > > > > > cloud stores)." and convey the scope of our vision, > > > > > > > given we have already been building towards that. It would also > > > > provide > > > > > > new > > > > > > > contributors a good lens to look at the project from. > > > > > > > > > > > > > > (This is very similar to for e.g, the evolution of Kafka from a > > > > pub-sub > >
Re: [DISCUSS] Hudi is the data lake platform
+1 The vision looks fantastic. On Tue, Apr 13, 2021 at 7:45 AM Gary Li wrote: > Awesome summary of Hudi! +1 as well. > > Gary Li > On 2021/04/13 14:13:24, Rubens Rodrigues > wrote: > > Excellent, I agree > > > > Em ter, 13 de abr de 2021 07:23, vino yang > escreveu: > > > > > +1 Excited by this new vision! > > > > > > Best, > > > Vino > > > > > > Dianjin Wang 于2021年4月13日周二 下午3:53写道: > > > > > > > +1 The new brand is straightforward, a better description of Hudi. > > > > > > > > Best, > > > > Dianjin Wang > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > bhavanisud...@gmail.com> > > > > wrote: > > > > > > > > > +1 . Cannot agree more. I think this makes total sense and will > provide > > > > for > > > > > a much better representation of the project. > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar > > > > > wrote: > > > > > > > > > > > Hello all, > > > > > > > > > > > > Reading one more article today, positioning Hudi, as just a table > > > > format, > > > > > > made me wonder, if we have done enough justice in explaining > what we > > > > have > > > > > > built together here. > > > > > > I tend to think of Hudi as the data lake platform, which has the > > > > > following > > > > > > components, of which - one if a table format, one is a > transactional > > > > > > storage layer. > > > > > > But the whole stack we have is definitely worth more than the > sum of > > > > all > > > > > > the parts IMO (speaking from my own experience from the past 10+ > > > years > > > > of > > > > > > open source software dev). > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > a) *table format* : something that stores table schema, a > metadata > > > > table > > > > > > that stores file listing today, and being extended to store > column > > > > ranges > > > > > > and more in the future (RFC-27) > > > > > > b) *aux metadata* : bloom filters, external record level indexes > > > today, > > > > > > bitmaps/interval trees and other advanced on-disk data structures > > > > > tomorrow > > > > > > c) *concurrency control* : we always supported MVCC based log > based > > > > > > concurrency (serialize writes into a time ordered log), and we > now > > > also > > > > > > have OCC for batch merge workloads with 0.8.0. We will have > > > multi-table > > > > > and > > > > > > fully non-blocking writers soon (see future work section of > RFC-22) > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case for > > > Hudi, > > > > > but > > > > > > we support primary/unique key constraints and we could add > foreign > > > keys > > > > > as > > > > > > an extension, once our transactions can span tables. > > > > > > e) *table services*: a hudi pipeline today is self-managing - > sizes > > > > > files, > > > > > > cleans, compacts, clusters data, bootstraps existing data - all > these > > > > > > actions working off each other without blocking one another. (for > > > most > > > > > > parts). > > > > > > f) *data services*: we also have higher level functionality with > > > > > > deltastreamer sources (scalable DFS listing source, Kafka, > Pulsar is > > > > > > coming, ...and more), incremental ETL support, de-duplication, > commit > > > > > > callbacks, pre-commit validations are coming, error tables have > been > > > > > > proposed. I could also envision us building towards streaming > egress, > > > > > data > > > > > > monitoring. > > > > > > > > > > > > I also think we should build the following (subject to separate > > > > > > DISCUSS/RFCs) > > > > > > > > > > > > g) *caching service*: Hudi specific caching service that can hold > > > > mutable > > > > > > data and serve oft-queried data across engines. > > > > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. > Let's > > > > > turn > > > > > > it into a scalable, sharded metastore, that all engines can use > to > > > > obtain > > > > > > any metadata. > > > > > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > > > opposed > > > > to > > > > > > "ingests & manages storage of large analytical datasets over DFS > > > (hdfs > > > > or > > > > > > cloud stores)." and convey the scope of our vision, > > > > > > given we have already been building towards that. It would also > > > provide > > > > > new > > > > > > contributors a good lens to look at the project from. > > > > > > > > > > > > (This is very similar to for e.g, the evolution of Kafka from a > > > pub-sub > > > > > > system, to an event streaming platform - with addition of > > > > > > MirrorMaker/Connect etc. ) > > > > > > > > > > > > Please share your thoughts! > > > > > > > > > > > > Thanks > > > > > > Vinoth > > > > > > > > > > > > > > > > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
++1. The rewording makes total sense Balaji.V On Tuesday, April 13, 2021, 07:45:16 AM PDT, Gary Li wrote: Awesome summary of Hudi! +1 as well. Gary Li On 2021/04/13 14:13:24, Rubens Rodrigues wrote: > Excellent, I agree > > Em ter, 13 de abr de 2021 07:23, vino yang escreveu: > > > +1 Excited by this new vision! > > > > Best, > > Vino > > > > Dianjin Wang 于2021年4月13日周二 下午3:53写道: > > > > > +1 The new brand is straightforward, a better description of Hudi. > > > > > > Best, > > > Dianjin Wang > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha > > > wrote: > > > > > > > +1 . Cannot agree more. I think this makes total sense and will provide > > > for > > > > a much better representation of the project. > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar > > > wrote: > > > > > > > > > Hello all, > > > > > > > > > > Reading one more article today, positioning Hudi, as just a table > > > format, > > > > > made me wonder, if we have done enough justice in explaining what we > > > have > > > > > built together here. > > > > > I tend to think of Hudi as the data lake platform, which has the > > > > following > > > > > components, of which - one if a table format, one is a transactional > > > > > storage layer. > > > > > But the whole stack we have is definitely worth more than the sum of > > > all > > > > > the parts IMO (speaking from my own experience from the past 10+ > > years > > > of > > > > > open source software dev). > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > a) *table format* : something that stores table schema, a metadata > > > table > > > > > that stores file listing today, and being extended to store column > > > ranges > > > > > and more in the future (RFC-27) > > > > > b) *aux metadata* : bloom filters, external record level indexes > > today, > > > > > bitmaps/interval trees and other advanced on-disk data structures > > > > tomorrow > > > > > c) *concurrency control* : we always supported MVCC based log based > > > > > concurrency (serialize writes into a time ordered log), and we now > > also > > > > > have OCC for batch merge workloads with 0.8.0. We will have > > multi-table > > > > and > > > > > fully non-blocking writers soon (see future work section of RFC-22) > > > > > d) *updates/deletes* : this is the bread-and-butter use-case for > > Hudi, > > > > but > > > > > we support primary/unique key constraints and we could add foreign > > keys > > > > as > > > > > an extension, once our transactions can span tables. > > > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > > > files, > > > > > cleans, compacts, clusters data, bootstraps existing data - all these > > > > > actions working off each other without blocking one another. (for > > most > > > > > parts). > > > > > f) *data services*: we also have higher level functionality with > > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > > > > coming, ...and more), incremental ETL support, de-duplication, commit > > > > > callbacks, pre-commit validations are coming, error tables have been > > > > > proposed. I could also envision us building towards streaming egress, > > > > data > > > > > monitoring. > > > > > > > > > > I also think we should build the following (subject to separate > > > > > DISCUSS/RFCs) > > > > > > > > > > g) *caching service*: Hudi specific caching service that can hold > > > mutable > > > > > data and serve oft-queried data across engines. > > > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > > > > turn > > > > > it into a scalable, sharded metastore, that all engines can use to > > > obtain > > > > > any metadata. > > > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > > opposed > > > to > > > > > "ingests & manages storage of large analytical datasets over DFS > > (hdfs > > > or > > > > > cloud stores)." and convey the scope of our vision, > > > > > given we have already been building towards that. It would also > > provide > > > > new > > > > > contributors a good lens to look at the project from. > > > > > > > > > > (This is very similar to for e.g, the evolution of Kafka from a > > pub-sub > > > > > system, to an event streaming platform - with addition of > > > > > MirrorMaker/Connect etc. ) > > > > > > > > > > Please share your thoughts! > > > > > > > > > > Thanks > > > > > Vinoth > > > > > > > > > > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
Awesome summary of Hudi! +1 as well. Gary Li On 2021/04/13 14:13:24, Rubens Rodrigues wrote: > Excellent, I agree > > Em ter, 13 de abr de 2021 07:23, vino yang escreveu: > > > +1 Excited by this new vision! > > > > Best, > > Vino > > > > Dianjin Wang 于2021年4月13日周二 下午3:53写道: > > > > > +1 The new brand is straightforward, a better description of Hudi. > > > > > > Best, > > > Dianjin Wang > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha > > > wrote: > > > > > > > +1 . Cannot agree more. I think this makes total sense and will provide > > > for > > > > a much better representation of the project. > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar > > > wrote: > > > > > > > > > Hello all, > > > > > > > > > > Reading one more article today, positioning Hudi, as just a table > > > format, > > > > > made me wonder, if we have done enough justice in explaining what we > > > have > > > > > built together here. > > > > > I tend to think of Hudi as the data lake platform, which has the > > > > following > > > > > components, of which - one if a table format, one is a transactional > > > > > storage layer. > > > > > But the whole stack we have is definitely worth more than the sum of > > > all > > > > > the parts IMO (speaking from my own experience from the past 10+ > > years > > > of > > > > > open source software dev). > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > a) *table format* : something that stores table schema, a metadata > > > table > > > > > that stores file listing today, and being extended to store column > > > ranges > > > > > and more in the future (RFC-27) > > > > > b) *aux metadata* : bloom filters, external record level indexes > > today, > > > > > bitmaps/interval trees and other advanced on-disk data structures > > > > tomorrow > > > > > c) *concurrency control* : we always supported MVCC based log based > > > > > concurrency (serialize writes into a time ordered log), and we now > > also > > > > > have OCC for batch merge workloads with 0.8.0. We will have > > multi-table > > > > and > > > > > fully non-blocking writers soon (see future work section of RFC-22) > > > > > d) *updates/deletes* : this is the bread-and-butter use-case for > > Hudi, > > > > but > > > > > we support primary/unique key constraints and we could add foreign > > keys > > > > as > > > > > an extension, once our transactions can span tables. > > > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > > > files, > > > > > cleans, compacts, clusters data, bootstraps existing data - all these > > > > > actions working off each other without blocking one another. (for > > most > > > > > parts). > > > > > f) *data services*: we also have higher level functionality with > > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > > > > coming, ...and more), incremental ETL support, de-duplication, commit > > > > > callbacks, pre-commit validations are coming, error tables have been > > > > > proposed. I could also envision us building towards streaming egress, > > > > data > > > > > monitoring. > > > > > > > > > > I also think we should build the following (subject to separate > > > > > DISCUSS/RFCs) > > > > > > > > > > g) *caching service*: Hudi specific caching service that can hold > > > mutable > > > > > data and serve oft-queried data across engines. > > > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > > > > turn > > > > > it into a scalable, sharded metastore, that all engines can use to > > > obtain > > > > > any metadata. > > > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > > opposed > > > to > > > > > "ingests & manages storage of large analytical datasets over DFS > > (hdfs > > > or > > > > > cloud stores)." and convey the scope of our vision, > > > > > given we have already been building towards that. It would also > > provide > > > > new > > > > > contributors a good lens to look at the project from. > > > > > > > > > > (This is very similar to for e.g, the evolution of Kafka from a > > pub-sub > > > > > system, to an event streaming platform - with addition of > > > > > MirrorMaker/Connect etc. ) > > > > > > > > > > Please share your thoughts! > > > > > > > > > > Thanks > > > > > Vinoth > > > > > > > > > > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
Excellent, I agree Em ter, 13 de abr de 2021 07:23, vino yang escreveu: > +1 Excited by this new vision! > > Best, > Vino > > Dianjin Wang 于2021年4月13日周二 下午3:53写道: > > > +1 The new brand is straightforward, a better description of Hudi. > > > > Best, > > Dianjin Wang > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha > > wrote: > > > > > +1 . Cannot agree more. I think this makes total sense and will provide > > for > > > a much better representation of the project. > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar > > wrote: > > > > > > > Hello all, > > > > > > > > Reading one more article today, positioning Hudi, as just a table > > format, > > > > made me wonder, if we have done enough justice in explaining what we > > have > > > > built together here. > > > > I tend to think of Hudi as the data lake platform, which has the > > > following > > > > components, of which - one if a table format, one is a transactional > > > > storage layer. > > > > But the whole stack we have is definitely worth more than the sum of > > all > > > > the parts IMO (speaking from my own experience from the past 10+ > years > > of > > > > open source software dev). > > > > > > > > Here's what we have built so far. > > > > > > > > a) *table format* : something that stores table schema, a metadata > > table > > > > that stores file listing today, and being extended to store column > > ranges > > > > and more in the future (RFC-27) > > > > b) *aux metadata* : bloom filters, external record level indexes > today, > > > > bitmaps/interval trees and other advanced on-disk data structures > > > tomorrow > > > > c) *concurrency control* : we always supported MVCC based log based > > > > concurrency (serialize writes into a time ordered log), and we now > also > > > > have OCC for batch merge workloads with 0.8.0. We will have > multi-table > > > and > > > > fully non-blocking writers soon (see future work section of RFC-22) > > > > d) *updates/deletes* : this is the bread-and-butter use-case for > Hudi, > > > but > > > > we support primary/unique key constraints and we could add foreign > keys > > > as > > > > an extension, once our transactions can span tables. > > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > > files, > > > > cleans, compacts, clusters data, bootstraps existing data - all these > > > > actions working off each other without blocking one another. (for > most > > > > parts). > > > > f) *data services*: we also have higher level functionality with > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > > > coming, ...and more), incremental ETL support, de-duplication, commit > > > > callbacks, pre-commit validations are coming, error tables have been > > > > proposed. I could also envision us building towards streaming egress, > > > data > > > > monitoring. > > > > > > > > I also think we should build the following (subject to separate > > > > DISCUSS/RFCs) > > > > > > > > g) *caching service*: Hudi specific caching service that can hold > > mutable > > > > data and serve oft-queried data across engines. > > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > > > turn > > > > it into a scalable, sharded metastore, that all engines can use to > > obtain > > > > any metadata. > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > opposed > > to > > > > "ingests & manages storage of large analytical datasets over DFS > (hdfs > > or > > > > cloud stores)." and convey the scope of our vision, > > > > given we have already been building towards that. It would also > provide > > > new > > > > contributors a good lens to look at the project from. > > > > > > > > (This is very similar to for e.g, the evolution of Kafka from a > pub-sub > > > > system, to an event streaming platform - with addition of > > > > MirrorMaker/Connect etc. ) > > > > > > > > Please share your thoughts! > > > > > > > > Thanks > > > > Vinoth > > > > > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
+1 Excited by this new vision! Best, Vino Dianjin Wang 于2021年4月13日周二 下午3:53写道: > +1 The new brand is straightforward, a better description of Hudi. > > Best, > Dianjin Wang > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha > wrote: > > > +1 . Cannot agree more. I think this makes total sense and will provide > for > > a much better representation of the project. > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar > wrote: > > > > > Hello all, > > > > > > Reading one more article today, positioning Hudi, as just a table > format, > > > made me wonder, if we have done enough justice in explaining what we > have > > > built together here. > > > I tend to think of Hudi as the data lake platform, which has the > > following > > > components, of which - one if a table format, one is a transactional > > > storage layer. > > > But the whole stack we have is definitely worth more than the sum of > all > > > the parts IMO (speaking from my own experience from the past 10+ years > of > > > open source software dev). > > > > > > Here's what we have built so far. > > > > > > a) *table format* : something that stores table schema, a metadata > table > > > that stores file listing today, and being extended to store column > ranges > > > and more in the future (RFC-27) > > > b) *aux metadata* : bloom filters, external record level indexes today, > > > bitmaps/interval trees and other advanced on-disk data structures > > tomorrow > > > c) *concurrency control* : we always supported MVCC based log based > > > concurrency (serialize writes into a time ordered log), and we now also > > > have OCC for batch merge workloads with 0.8.0. We will have multi-table > > and > > > fully non-blocking writers soon (see future work section of RFC-22) > > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, > > but > > > we support primary/unique key constraints and we could add foreign keys > > as > > > an extension, once our transactions can span tables. > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > files, > > > cleans, compacts, clusters data, bootstraps existing data - all these > > > actions working off each other without blocking one another. (for most > > > parts). > > > f) *data services*: we also have higher level functionality with > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > > coming, ...and more), incremental ETL support, de-duplication, commit > > > callbacks, pre-commit validations are coming, error tables have been > > > proposed. I could also envision us building towards streaming egress, > > data > > > monitoring. > > > > > > I also think we should build the following (subject to separate > > > DISCUSS/RFCs) > > > > > > g) *caching service*: Hudi specific caching service that can hold > mutable > > > data and serve oft-queried data across engines. > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > > turn > > > it into a scalable, sharded metastore, that all engines can use to > obtain > > > any metadata. > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed > to > > > "ingests & manages storage of large analytical datasets over DFS (hdfs > or > > > cloud stores)." and convey the scope of our vision, > > > given we have already been building towards that. It would also provide > > new > > > contributors a good lens to look at the project from. > > > > > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub > > > system, to an event streaming platform - with addition of > > > MirrorMaker/Connect etc. ) > > > > > > Please share your thoughts! > > > > > > Thanks > > > Vinoth > > > > > >
Re: [DISCUSS] Hudi is the data lake platform
+1 The new brand is straightforward, a better description of Hudi. Best, Dianjin Wang On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha wrote: > +1 . Cannot agree more. I think this makes total sense and will provide for > a much better representation of the project. > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar wrote: > > > Hello all, > > > > Reading one more article today, positioning Hudi, as just a table format, > > made me wonder, if we have done enough justice in explaining what we have > > built together here. > > I tend to think of Hudi as the data lake platform, which has the > following > > components, of which - one if a table format, one is a transactional > > storage layer. > > But the whole stack we have is definitely worth more than the sum of all > > the parts IMO (speaking from my own experience from the past 10+ years of > > open source software dev). > > > > Here's what we have built so far. > > > > a) *table format* : something that stores table schema, a metadata table > > that stores file listing today, and being extended to store column ranges > > and more in the future (RFC-27) > > b) *aux metadata* : bloom filters, external record level indexes today, > > bitmaps/interval trees and other advanced on-disk data structures > tomorrow > > c) *concurrency control* : we always supported MVCC based log based > > concurrency (serialize writes into a time ordered log), and we now also > > have OCC for batch merge workloads with 0.8.0. We will have multi-table > and > > fully non-blocking writers soon (see future work section of RFC-22) > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, > but > > we support primary/unique key constraints and we could add foreign keys > as > > an extension, once our transactions can span tables. > > e) *table services*: a hudi pipeline today is self-managing - sizes > files, > > cleans, compacts, clusters data, bootstraps existing data - all these > > actions working off each other without blocking one another. (for most > > parts). > > f) *data services*: we also have higher level functionality with > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > coming, ...and more), incremental ETL support, de-duplication, commit > > callbacks, pre-commit validations are coming, error tables have been > > proposed. I could also envision us building towards streaming egress, > data > > monitoring. > > > > I also think we should build the following (subject to separate > > DISCUSS/RFCs) > > > > g) *caching service*: Hudi specific caching service that can hold mutable > > data and serve oft-queried data across engines. > > h) t*imeline metaserver:* We already run a metaserver in spark > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > turn > > it into a scalable, sharded metastore, that all engines can use to obtain > > any metadata. > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to > > "ingests & manages storage of large analytical datasets over DFS (hdfs or > > cloud stores)." and convey the scope of our vision, > > given we have already been building towards that. It would also provide > new > > contributors a good lens to look at the project from. > > > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub > > system, to an event streaming platform - with addition of > > MirrorMaker/Connect etc. ) > > > > Please share your thoughts! > > > > Thanks > > Vinoth > > >
Re: [DISCUSS] Hudi is the data lake platform
+1 . Cannot agree more. I think this makes total sense and will provide for a much better representation of the project. On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar wrote: > Hello all, > > Reading one more article today, positioning Hudi, as just a table format, > made me wonder, if we have done enough justice in explaining what we have > built together here. > I tend to think of Hudi as the data lake platform, which has the following > components, of which - one if a table format, one is a transactional > storage layer. > But the whole stack we have is definitely worth more than the sum of all > the parts IMO (speaking from my own experience from the past 10+ years of > open source software dev). > > Here's what we have built so far. > > a) *table format* : something that stores table schema, a metadata table > that stores file listing today, and being extended to store column ranges > and more in the future (RFC-27) > b) *aux metadata* : bloom filters, external record level indexes today, > bitmaps/interval trees and other advanced on-disk data structures tomorrow > c) *concurrency control* : we always supported MVCC based log based > concurrency (serialize writes into a time ordered log), and we now also > have OCC for batch merge workloads with 0.8.0. We will have multi-table and > fully non-blocking writers soon (see future work section of RFC-22) > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but > we support primary/unique key constraints and we could add foreign keys as > an extension, once our transactions can span tables. > e) *table services*: a hudi pipeline today is self-managing - sizes files, > cleans, compacts, clusters data, bootstraps existing data - all these > actions working off each other without blocking one another. (for most > parts). > f) *data services*: we also have higher level functionality with > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > coming, ...and more), incremental ETL support, de-duplication, commit > callbacks, pre-commit validations are coming, error tables have been > proposed. I could also envision us building towards streaming egress, data > monitoring. > > I also think we should build the following (subject to separate > DISCUSS/RFCs) > > g) *caching service*: Hudi specific caching service that can hold mutable > data and serve oft-queried data across engines. > h) t*imeline metaserver:* We already run a metaserver in spark > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn > it into a scalable, sharded metastore, that all engines can use to obtain > any metadata. > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to > "ingests & manages storage of large analytical datasets over DFS (hdfs or > cloud stores)." and convey the scope of our vision, > given we have already been building towards that. It would also provide new > contributors a good lens to look at the project from. > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub > system, to an event streaming platform - with addition of > MirrorMaker/Connect etc. ) > > Please share your thoughts! > > Thanks > Vinoth >
[DISCUSS] Hudi is the data lake platform
Hello all, Reading one more article today, positioning Hudi, as just a table format, made me wonder, if we have done enough justice in explaining what we have built together here. I tend to think of Hudi as the data lake platform, which has the following components, of which - one if a table format, one is a transactional storage layer. But the whole stack we have is definitely worth more than the sum of all the parts IMO (speaking from my own experience from the past 10+ years of open source software dev). Here's what we have built so far. a) *table format* : something that stores table schema, a metadata table that stores file listing today, and being extended to store column ranges and more in the future (RFC-27) b) *aux metadata* : bloom filters, external record level indexes today, bitmaps/interval trees and other advanced on-disk data structures tomorrow c) *concurrency control* : we always supported MVCC based log based concurrency (serialize writes into a time ordered log), and we now also have OCC for batch merge workloads with 0.8.0. We will have multi-table and fully non-blocking writers soon (see future work section of RFC-22) d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but we support primary/unique key constraints and we could add foreign keys as an extension, once our transactions can span tables. e) *table services*: a hudi pipeline today is self-managing - sizes files, cleans, compacts, clusters data, bootstraps existing data - all these actions working off each other without blocking one another. (for most parts). f) *data services*: we also have higher level functionality with deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is coming, ...and more), incremental ETL support, de-duplication, commit callbacks, pre-commit validations are coming, error tables have been proposed. I could also envision us building towards streaming egress, data monitoring. I also think we should build the following (subject to separate DISCUSS/RFCs) g) *caching service*: Hudi specific caching service that can hold mutable data and serve oft-queried data across engines. h) t*imeline metaserver:* We already run a metaserver in spark writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn it into a scalable, sharded metastore, that all engines can use to obtain any metadata. To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to "ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores)." and convey the scope of our vision, given we have already been building towards that. It would also provide new contributors a good lens to look at the project from. (This is very similar to for e.g, the evolution of Kafka from a pub-sub system, to an event streaming platform - with addition of MirrorMaker/Connect etc. ) Please share your thoughts! Thanks Vinoth