Hi Brandon, This is more of practical advice than sharing how to solve it using Hudi. By and large, this need can be mitigated by serializing your writes in an upstream message queue like Kafka.. For e.g , lets say you want to delete some records in a table, that is being currently ingested by deltastreamer.. All you need to do is log more deletes as described in this blog here, into the upstream kafka topic.. This will serialize the writes automatically for you. Atleast in my experience, I found this a much more efficient way of doing this, rather than allowing two writers to proceed and failing the all except the latest writer.. Downstream ETL tables built using spark jobs also typically tend to single writer. In short, I am saying keep things single writer.
That said, Of late, I have been thinking about this in the context of multi table transactions and seeing if we actually add the support.. Love to have some design partners if there is interest :) Thanks Vinoth On Tue, Apr 14, 2020 at 9:23 AM Scheller, Brandon <[email protected]> wrote: > Hi all, > > If I understand correctly, Hudi is not currently recommended for the > concurrent writer use cases. I was wondering what the community’s official > stance on concurrency is, and what the recommended workarounds/solutions > are for Hudi to help prevent data corruption/duplication (For example we’ve > heard of environments using an external table lock). > > Thanks, > Brandon >
