Hi Roshan,

https://eng.uber.com/uber-big-data-platform/ talks about the two tables in
more detail. In our scenario, the log and snapshot table records are not
1-1 and we want to retain uncompacted logs, to understand state changes for
e.g.  This decision does not have anything to do with compaction
scheduling.
Hope that helps

Thanks
Vinoth

On Tue, May 14, 2019 at 11:46 AM Roshan Nair (Data Platform)
<[email protected]> wrote:

> Hey Vinoth,
>
> Thanks. Since you mention you do the same at Uber, is there a use for
> keeping the log forever?
> Or is it just more practical to maintain two tables rather than coordinate
> an off-peak slot to run compaction on the read-optimized view?
>
> Rishan
>
> On Wed, May 15, 2019 at 12:05 AM Vinoth Chandar <[email protected]> wrote:
>
> > Hi Roshan,
> >
> > Good point. Actually the incremental view + either
> read-optimized/realtime
> > view can provide similar functionality.
> > However, I think Minh wanted to keep a log forever. When using just a
> > single Hudi dataset, once the compactor runs or the cleaning happens, the
> > log is compacted away.
> > Does that make sense?
> >
> > Thanks
> > Vinoth
> >
> > On Tue, May 14, 2019 at 2:19 AM Roshan Nair (Data Platform)
> > <[email protected]> wrote:
> >
> > > Hi Minh/Vinoth,
> > >
> > > I'm curious about what use cases having two tables addresses. I'm
> > assuming
> > > here the two tables you mention are the read-optimized (COW) table, and
> > an
> > > uncompacted write optimized (MOR) table.
> > >
> > > Hudi already provides two views (read-optimized and write-optimized) on
> > the
> > > same table, so what use cases require splitting this into two different
> > > hudi tables?
> > >
> > > Roshan
> > >
> > > On Tue, May 7, 2019 at 8:58 PM Vinoth Chandar <[email protected]>
> wrote:
> > >
> > > > Thanks for starting the thread, Minh!
> > > >
> > > > We do the same thing at Uber actually. Its handy to join these two at
> > > times
> > > > and its a common pattern.
> > > > so curious to know what others think?
> > > >
> > > > DeltaStreamer option seems like a good idea. Some implementation
> > > > considerations on how we configure this second table etc..
> > > > but we can figure that out on the PR/JIRA.
> > > >
> > > > >  Can we update both tables transactionally? This would be a nice
> > > > property to have. The current 2-job pattern does not support this.
> > > > It's achievable with some caveats. For e.g, you can write both to
> > > datasets,
> > > > then commit the second one only after first one succeeds. If second
> > > commit
> > > > fails, then we do restore/rollback first one. Note that some queries
> > may
> > > > have already picked up the first commit changes technically speaking
> > > (race
> > > > time window will be small). General support for this, needs more work
> > and
> > > > overlaying timelines etc... You are welcome to take this on if you
> are
> > > > interested. :)
> > > >
> > > > > Can we share the Avro logs? This might save some time as well
> > > > as achieving the transactionality mentioned above but it increases
> > > > complexity.
> > > > yes. it would change the core models and design a lot. In some cases,
> > the
> > > > logs may not even be the same across these tables. for e.g, if you
> take
> > > the
> > > > HBase data model, you might get new cells out of your change stream,
> > > which
> > > > is the raw change log . You can have the snapshot/row table have
> either
> > > > cells in the Avro log or full row images, depending on where you want
> > to
> > > > pay the cost of merge. let me know what you think.
> > > >
> > > >
> > > >
> > > > On Mon, May 6, 2019 at 10:19 PM Minh Pham <[email protected]>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > A common pattern that I see is having 1 Kafka topic for data change
> > > > events
> > > > > and 2 Hudi ingestion job (1 in insert mode and 1 in upsert mode).
> > This
> > > > > creates 2 tables, 1 with all raw data change events and 1 with the
> > > latest
> > > > > snapshot of data.
> > > > >
> > > > > What do you guys think about adding support for as an option in
> > > > > DeltaStreamer?
> > > > >
> > > > > There are some complications to consider:
> > > > > - Can we update both tables transactionally? This would be a nice
> > > > property
> > > > > to have. The current 2-job pattern does not support this.
> > > > > - Can we share the Avro logs? This might save some time as well as
> > > > > achieving the transactionality mentioned above but it increases
> > > > complexity.
> > > > >
> > > > > Best,
> > > > > Minh
> > > > >
> > > >
> > >
> >
>

Reply via email to