Re: [DISCUSS] Hudi is the data lake platform

Bhavani Sudha Mon, 12 Apr 2021 22:41:31 -0700

+1 . Cannot agree more. I think this makes total sense and will provide for
a much better representation of the project.


On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <vin...@apache.org> wrote:

> Hello all,
>
> Reading one more article today, positioning Hudi, as just a table format,
> made me wonder, if we have done enough justice in explaining what we have
> built together here.
> I tend to think of Hudi as the data lake platform, which has the following
> components, of which - one if a table format, one is a transactional
> storage layer.
> But the whole stack we have is definitely worth more than the sum of all
> the parts IMO (speaking from my own experience from the past 10+ years of
> open source software dev).
>
> Here's what we have built so far.
>
> a) *table format* : something that stores table schema, a metadata table
> that stores file listing today, and being extended to store column ranges
> and more in the future (RFC-27)
> b) *aux metadata* : bloom filters, external record level indexes today,
> bitmaps/interval trees and other advanced on-disk data structures tomorrow
> c) *concurrency control* : we always supported MVCC based log based
> concurrency (serialize writes into a time ordered log), and we now also
> have OCC for batch merge workloads with 0.8.0. We will have multi-table and
> fully non-blocking writers soon (see future work section of RFC-22)
> d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but
> we support primary/unique key constraints and we could add foreign keys as
> an extension, once our transactions can span tables.
> e) *table services*: a hudi pipeline today is self-managing - sizes files,
> cleans, compacts, clusters data, bootstraps existing data - all these
> actions working off each other without blocking one another. (for most
> parts).
> f) *data services*: we also have higher level functionality with
> deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> coming, ...and more), incremental ETL support, de-duplication, commit
> callbacks, pre-commit validations are coming, error tables have been
> proposed. I could also envision us building towards streaming egress, data
> monitoring.
>
> I also think we should build the following (subject to separate
> DISCUSS/RFCs)
>
> g) *caching service*: Hudi specific caching service that can hold mutable
> data and serve oft-queried data across engines.
> h) t*imeline metaserver:* We already run a metaserver in spark
> writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn
> it into a scalable, sharded metastore, that all engines can use to obtain
> any metadata.
>
> To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
> "ingests & manages storage of large analytical datasets over DFS (hdfs or
> cloud stores)." and convey the scope of our vision,
> given we have already been building towards that. It would also provide new
> contributors a good lens to look at the project from.
>
> (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> system, to an event streaming platform - with addition of
> MirrorMaker/Connect etc. )
>
> Please share your thoughts!
>
> Thanks
> Vinoth
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to