Hi Minh/Vinoth, I'm curious about what use cases having two tables addresses. I'm assuming here the two tables you mention are the read-optimized (COW) table, and an uncompacted write optimized (MOR) table.
Hudi already provides two views (read-optimized and write-optimized) on the same table, so what use cases require splitting this into two different hudi tables? Roshan On Tue, May 7, 2019 at 8:58 PM Vinoth Chandar <[email protected]> wrote: > Thanks for starting the thread, Minh! > > We do the same thing at Uber actually. Its handy to join these two at times > and its a common pattern. > so curious to know what others think? > > DeltaStreamer option seems like a good idea. Some implementation > considerations on how we configure this second table etc.. > but we can figure that out on the PR/JIRA. > > > Can we update both tables transactionally? This would be a nice > property to have. The current 2-job pattern does not support this. > It's achievable with some caveats. For e.g, you can write both to datasets, > then commit the second one only after first one succeeds. If second commit > fails, then we do restore/rollback first one. Note that some queries may > have already picked up the first commit changes technically speaking (race > time window will be small). General support for this, needs more work and > overlaying timelines etc... You are welcome to take this on if you are > interested. :) > > > Can we share the Avro logs? This might save some time as well > as achieving the transactionality mentioned above but it increases > complexity. > yes. it would change the core models and design a lot. In some cases, the > logs may not even be the same across these tables. for e.g, if you take the > HBase data model, you might get new cells out of your change stream, which > is the raw change log . You can have the snapshot/row table have either > cells in the Avro log or full row images, depending on where you want to > pay the cost of merge. let me know what you think. > > > > On Mon, May 6, 2019 at 10:19 PM Minh Pham <[email protected]> wrote: > > > Hi, > > > > A common pattern that I see is having 1 Kafka topic for data change > events > > and 2 Hudi ingestion job (1 in insert mode and 1 in upsert mode). This > > creates 2 tables, 1 with all raw data change events and 1 with the latest > > snapshot of data. > > > > What do you guys think about adding support for as an option in > > DeltaStreamer? > > > > There are some complications to consider: > > - Can we update both tables transactionally? This would be a nice > property > > to have. The current 2-job pattern does not support this. > > - Can we share the Avro logs? This might save some time as well as > > achieving the transactionality mentioned above but it increases > complexity. > > > > Best, > > Minh > > >
