Re: Data change events table in Hudi

Roshan Nair (Data Platform) Tue, 14 May 2019 13:07:29 -0700

Thanks Vinoth,

Very useful.


Roshan

On Wed, May 15, 2019 at 1:34 AM Vinoth Chandar <[email protected]> wrote:

> Hi Roshan,
>
> https://eng.uber.com/uber-big-data-platform/ talks about the two tables in
> more detail. In our scenario, the log and snapshot table records are not
> 1-1 and we want to retain uncompacted logs, to understand state changes for
> e.g.  This decision does not have anything to do with compaction
> scheduling.
> Hope that helps
>
> Thanks
> Vinoth
>
> On Tue, May 14, 2019 at 11:46 AM Roshan Nair (Data Platform)
> <[email protected]> wrote:
>
> > Hey Vinoth,
> >
> > Thanks. Since you mention you do the same at Uber, is there a use for
> > keeping the log forever?
> > Or is it just more practical to maintain two tables rather than
> coordinate
> > an off-peak slot to run compaction on the read-optimized view?
> >
> > Rishan
> >
> > On Wed, May 15, 2019 at 12:05 AM Vinoth Chandar <[email protected]>
> wrote:
> >
> > > Hi Roshan,
> > >
> > > Good point. Actually the incremental view + either
> > read-optimized/realtime
> > > view can provide similar functionality.
> > > However, I think Minh wanted to keep a log forever. When using just a
> > > single Hudi dataset, once the compactor runs or the cleaning happens,
> the
> > > log is compacted away.
> > > Does that make sense?
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, May 14, 2019 at 2:19 AM Roshan Nair (Data Platform)
> > > <[email protected]> wrote:
> > >
> > > > Hi Minh/Vinoth,
> > > >
> > > > I'm curious about what use cases having two tables addresses. I'm
> > > assuming
> > > > here the two tables you mention are the read-optimized (COW) table,
> and
> > > an
> > > > uncompacted write optimized (MOR) table.
> > > >
> > > > Hudi already provides two views (read-optimized and write-optimized)
> on
> > > the
> > > > same table, so what use cases require splitting this into two
> different
> > > > hudi tables?
> > > >
> > > > Roshan
> > > >
> > > > On Tue, May 7, 2019 at 8:58 PM Vinoth Chandar <[email protected]>
> > wrote:
> > > >
> > > > > Thanks for starting the thread, Minh!
> > > > >
> > > > > We do the same thing at Uber actually. Its handy to join these two
> at
> > > > times
> > > > > and its a common pattern.
> > > > > so curious to know what others think?
> > > > >
> > > > > DeltaStreamer option seems like a good idea. Some implementation
> > > > > considerations on how we configure this second table etc..
> > > > > but we can figure that out on the PR/JIRA.
> > > > >
> > > > > >  Can we update both tables transactionally? This would be a nice
> > > > > property to have. The current 2-job pattern does not support this.
> > > > > It's achievable with some caveats. For e.g, you can write both to
> > > > datasets,
> > > > > then commit the second one only after first one succeeds. If second
> > > > commit
> > > > > fails, then we do restore/rollback first one. Note that some
> queries
> > > may
> > > > > have already picked up the first commit changes technically
> speaking
> > > > (race
> > > > > time window will be small). General support for this, needs more
> work
> > > and
> > > > > overlaying timelines etc... You are welcome to take this on if you
> > are
> > > > > interested. :)
> > > > >
> > > > > > Can we share the Avro logs? This might save some time as well
> > > > > as achieving the transactionality mentioned above but it increases
> > > > > complexity.
> > > > > yes. it would change the core models and design a lot. In some
> cases,
> > > the
> > > > > logs may not even be the same across these tables. for e.g, if you
> > take
> > > > the
> > > > > HBase data model, you might get new cells out of your change
> stream,
> > > > which
> > > > > is the raw change log . You can have the snapshot/row table have
> > either
> > > > > cells in the Avro log or full row images, depending on where you
> want
> > > to
> > > > > pay the cost of merge. let me know what you think.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, May 6, 2019 at 10:19 PM Minh Pham <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > A common pattern that I see is having 1 Kafka topic for data
> change
> > > > > events
> > > > > > and 2 Hudi ingestion job (1 in insert mode and 1 in upsert mode).
> > > This
> > > > > > creates 2 tables, 1 with all raw data change events and 1 with
> the
> > > > latest
> > > > > > snapshot of data.
> > > > > >
> > > > > > What do you guys think about adding support for as an option in
> > > > > > DeltaStreamer?
> > > > > >
> > > > > > There are some complications to consider:
> > > > > > - Can we update both tables transactionally? This would be a nice
> > > > > property
> > > > > > to have. The current 2-job pattern does not support this.
> > > > > > - Can we share the Avro logs? This might save some time as well
> as
> > > > > > achieving the transactionality mentioned above but it increases
> > > > > complexity.
> > > > > >
> > > > > > Best,
> > > > > > Minh
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Data change events table in Hudi

Reply via email to