Thanks Vinoth, Very useful.
Roshan On Wed, May 15, 2019 at 1:34 AM Vinoth Chandar <[email protected]> wrote: > Hi Roshan, > > https://eng.uber.com/uber-big-data-platform/ talks about the two tables in > more detail. In our scenario, the log and snapshot table records are not > 1-1 and we want to retain uncompacted logs, to understand state changes for > e.g. This decision does not have anything to do with compaction > scheduling. > Hope that helps > > Thanks > Vinoth > > On Tue, May 14, 2019 at 11:46 AM Roshan Nair (Data Platform) > <[email protected]> wrote: > > > Hey Vinoth, > > > > Thanks. Since you mention you do the same at Uber, is there a use for > > keeping the log forever? > > Or is it just more practical to maintain two tables rather than > coordinate > > an off-peak slot to run compaction on the read-optimized view? > > > > Rishan > > > > On Wed, May 15, 2019 at 12:05 AM Vinoth Chandar <[email protected]> > wrote: > > > > > Hi Roshan, > > > > > > Good point. Actually the incremental view + either > > read-optimized/realtime > > > view can provide similar functionality. > > > However, I think Minh wanted to keep a log forever. When using just a > > > single Hudi dataset, once the compactor runs or the cleaning happens, > the > > > log is compacted away. > > > Does that make sense? > > > > > > Thanks > > > Vinoth > > > > > > On Tue, May 14, 2019 at 2:19 AM Roshan Nair (Data Platform) > > > <[email protected]> wrote: > > > > > > > Hi Minh/Vinoth, > > > > > > > > I'm curious about what use cases having two tables addresses. I'm > > > assuming > > > > here the two tables you mention are the read-optimized (COW) table, > and > > > an > > > > uncompacted write optimized (MOR) table. > > > > > > > > Hudi already provides two views (read-optimized and write-optimized) > on > > > the > > > > same table, so what use cases require splitting this into two > different > > > > hudi tables? > > > > > > > > Roshan > > > > > > > > On Tue, May 7, 2019 at 8:58 PM Vinoth Chandar <[email protected]> > > wrote: > > > > > > > > > Thanks for starting the thread, Minh! > > > > > > > > > > We do the same thing at Uber actually. Its handy to join these two > at > > > > times > > > > > and its a common pattern. > > > > > so curious to know what others think? > > > > > > > > > > DeltaStreamer option seems like a good idea. Some implementation > > > > > considerations on how we configure this second table etc.. > > > > > but we can figure that out on the PR/JIRA. > > > > > > > > > > > Can we update both tables transactionally? This would be a nice > > > > > property to have. The current 2-job pattern does not support this. > > > > > It's achievable with some caveats. For e.g, you can write both to > > > > datasets, > > > > > then commit the second one only after first one succeeds. If second > > > > commit > > > > > fails, then we do restore/rollback first one. Note that some > queries > > > may > > > > > have already picked up the first commit changes technically > speaking > > > > (race > > > > > time window will be small). General support for this, needs more > work > > > and > > > > > overlaying timelines etc... You are welcome to take this on if you > > are > > > > > interested. :) > > > > > > > > > > > Can we share the Avro logs? This might save some time as well > > > > > as achieving the transactionality mentioned above but it increases > > > > > complexity. > > > > > yes. it would change the core models and design a lot. In some > cases, > > > the > > > > > logs may not even be the same across these tables. for e.g, if you > > take > > > > the > > > > > HBase data model, you might get new cells out of your change > stream, > > > > which > > > > > is the raw change log . You can have the snapshot/row table have > > either > > > > > cells in the Avro log or full row images, depending on where you > want > > > to > > > > > pay the cost of merge. let me know what you think. > > > > > > > > > > > > > > > > > > > > On Mon, May 6, 2019 at 10:19 PM Minh Pham <[email protected]> > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > A common pattern that I see is having 1 Kafka topic for data > change > > > > > events > > > > > > and 2 Hudi ingestion job (1 in insert mode and 1 in upsert mode). > > > This > > > > > > creates 2 tables, 1 with all raw data change events and 1 with > the > > > > latest > > > > > > snapshot of data. > > > > > > > > > > > > What do you guys think about adding support for as an option in > > > > > > DeltaStreamer? > > > > > > > > > > > > There are some complications to consider: > > > > > > - Can we update both tables transactionally? This would be a nice > > > > > property > > > > > > to have. The current 2-job pattern does not support this. > > > > > > - Can we share the Avro logs? This might save some time as well > as > > > > > > achieving the transactionality mentioned above but it increases > > > > > complexity. > > > > > > > > > > > > Best, > > > > > > Minh > > > > > > > > > > > > > > > > > > > > >
