Hey Vinoth - I think what we are trying to understand is if Hudi has any built-in mechanism to prevent accidents from happening? In the event that due to some errors if two concurrent writers do end trying to write to the same table - would Hudi allow both to run or allow only one to succeed.
Brandon, was looking at the code, and it seems like Hudi, by default on start of a new commit - will rollback any pending commit - which in theory does not allow concurrent writes, but it seemed a bit arbitrary. Hence we wanted to understand if this was actually intended to prevent concurrent writes on the same table - or this isn't the intention of the code below, and users would have to do something externally like serializing the writes or building some sort of locking protocol outside of Hudi? HoodieWriteClient always initialized to “rollbackPending” rolling back previous commits. Delta Sync: https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L479 Spark-Writer: https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L195 HoodieWriteClient constructor https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L120 HoodieWriteClient rollbackPending Method https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L1024 On 2020/04/14 21:28:05, Vinoth Chandar <[email protected]> wrote: > Hi Brandon, > > This is more of practical advice than sharing how to solve it using Hudi. > By and large, this need can be mitigated by serializing your writes in an > upstream message queue like Kafka.. For e.g , lets say you want to delete > some records in a table, that is being currently ingested by > deltastreamer.. All you need to do is log more deletes as described in this > blog here, into the upstream kafka topic.. This will serialize the writes > automatically for you. Atleast in my experience, I found this a much more > efficient way of doing this, rather than allowing two writers to proceed > and failing the all except the latest writer.. Downstream ETL tables built > using spark jobs also typically tend to single writer. In short, I am > saying keep things single writer. > > That said, Of late, I have been thinking about this in the context of multi > table transactions and seeing if we actually add the support.. Love to have > some design partners if there is interest :) > > Thanks > Vinoth > > On Tue, Apr 14, 2020 at 9:23 AM Scheller, Brandon > <[email protected]> wrote: > > > Hi all, > > > > If I understand correctly, Hudi is not currently recommended for the > > concurrent writer use cases. I was wondering what the community’s official > > stance on concurrency is, and what the recommended workarounds/solutions > > are for Hudi to help prevent data corruption/duplication (For example we’ve > > heard of environments using an external table lock). > > > > Thanks, > > Brandon > > >
