Thanks for starting the thread, Minh! We do the same thing at Uber actually. Its handy to join these two at times and its a common pattern. so curious to know what others think?
DeltaStreamer option seems like a good idea. Some implementation considerations on how we configure this second table etc.. but we can figure that out on the PR/JIRA. > Can we update both tables transactionally? This would be a nice property to have. The current 2-job pattern does not support this. It's achievable with some caveats. For e.g, you can write both to datasets, then commit the second one only after first one succeeds. If second commit fails, then we do restore/rollback first one. Note that some queries may have already picked up the first commit changes technically speaking (race time window will be small). General support for this, needs more work and overlaying timelines etc... You are welcome to take this on if you are interested. :) > Can we share the Avro logs? This might save some time as well as achieving the transactionality mentioned above but it increases complexity. yes. it would change the core models and design a lot. In some cases, the logs may not even be the same across these tables. for e.g, if you take the HBase data model, you might get new cells out of your change stream, which is the raw change log . You can have the snapshot/row table have either cells in the Avro log or full row images, depending on where you want to pay the cost of merge. let me know what you think. On Mon, May 6, 2019 at 10:19 PM Minh Pham <[email protected]> wrote: > Hi, > > A common pattern that I see is having 1 Kafka topic for data change events > and 2 Hudi ingestion job (1 in insert mode and 1 in upsert mode). This > creates 2 tables, 1 with all raw data change events and 1 with the latest > snapshot of data. > > What do you guys think about adding support for as an option in > DeltaStreamer? > > There are some complications to consider: > - Can we update both tables transactionally? This would be a nice property > to have. The current 2-job pattern does not support this. > - Can we share the Avro logs? This might save some time as well as > achieving the transactionality mentioned above but it increases complexity. > > Best, > Minh >
