Re: Data change events table in Hudi

Roshan Nair (Data Platform) Tue, 14 May 2019 02:19:51 -0700

Hi Minh/Vinoth,

I'm curious about what use cases having two tables addresses. I'm assuming
here the two tables you mention are the read-optimized (COW) table, and an
uncompacted write optimized (MOR) table.


Hudi already provides two views (read-optimized and write-optimized) on the
same table, so what use cases require splitting this into two different
hudi tables?

Roshan

On Tue, May 7, 2019 at 8:58 PM Vinoth Chandar <[email protected]> wrote:

> Thanks for starting the thread, Minh!
>
> We do the same thing at Uber actually. Its handy to join these two at times
> and its a common pattern.
> so curious to know what others think?
>
> DeltaStreamer option seems like a good idea. Some implementation
> considerations on how we configure this second table etc..
> but we can figure that out on the PR/JIRA.
>
> >  Can we update both tables transactionally? This would be a nice
> property to have. The current 2-job pattern does not support this.
> It's achievable with some caveats. For e.g, you can write both to datasets,
> then commit the second one only after first one succeeds. If second commit
> fails, then we do restore/rollback first one. Note that some queries may
> have already picked up the first commit changes technically speaking (race
> time window will be small). General support for this, needs more work and
> overlaying timelines etc... You are welcome to take this on if you are
> interested. :)
>
> > Can we share the Avro logs? This might save some time as well
> as achieving the transactionality mentioned above but it increases
> complexity.
> yes. it would change the core models and design a lot. In some cases, the
> logs may not even be the same across these tables. for e.g, if you take the
> HBase data model, you might get new cells out of your change stream, which
> is the raw change log . You can have the snapshot/row table have either
> cells in the Avro log or full row images, depending on where you want to
> pay the cost of merge. let me know what you think.
>
>
>
> On Mon, May 6, 2019 at 10:19 PM Minh Pham <[email protected]> wrote:
>
> > Hi,
> >
> > A common pattern that I see is having 1 Kafka topic for data change
> events
> > and 2 Hudi ingestion job (1 in insert mode and 1 in upsert mode). This
> > creates 2 tables, 1 with all raw data change events and 1 with the latest
> > snapshot of data.
> >
> > What do you guys think about adding support for as an option in
> > DeltaStreamer?
> >
> > There are some complications to consider:
> > - Can we update both tables transactionally? This would be a nice
> property
> > to have. The current 2-job pattern does not support this.
> > - Can we share the Avro logs? This might save some time as well as
> > achieving the transactionality mentioned above but it increases
> complexity.
> >
> > Best,
> > Minh
> >
>

Re: Data change events table in Hudi

Reply via email to