Re: Hudi concurrent writes

Rahul Bhartia Fri, 17 Apr 2020 11:13:15 -0700

Hey Vinoth -

I think what we are trying to understand is if Hudi has any built-in mechanism 
to prevent accidents from happening? In the event that due to some errors if 
two concurrent writers do end trying to write to the same table - would Hudi 
allow both to run or allow only one to succeed.

Brandon, was looking at the code, and it seems like Hudi, by default on start 
of a new commit - will rollback any pending commit - which in theory does not 
allow concurrent writes, but it seemed a bit arbitrary. Hence we wanted to 
understand if this was actually intended to prevent concurrent writes on the 
same table - or this isn't the intention of the code below, and users would 
have to do something externally like serializing the writes or building some 
sort of locking protocol outside of Hudi?

HoodieWriteClient always initialized to “rollbackPending” rolling back previous 
commits.
Delta Sync: 
https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L479
Spark-Writer: 
https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L195

HoodieWriteClient constructor
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L120
HoodieWriteClient rollbackPending Method
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L1024

On 2020/04/14 21:28:05, Vinoth Chandar <[email protected]> wrote: 
> Hi Brandon,
> 
> This is more of practical advice than sharing how to solve it using Hudi.
> By and large, this need can be mitigated by serializing your writes in an
> upstream message queue like Kafka.. For e.g , lets say you want to delete
> some records in a table, that is being currently ingested by
> deltastreamer.. All you need to do is log more deletes as described in this
> blog here, into the upstream kafka topic.. This will serialize the writes
> automatically for you. Atleast in my experience, I found this a much more
> efficient way of doing this, rather than allowing two writers to proceed
> and failing the all except the latest writer..  Downstream ETL tables built
> using spark jobs also typically tend to single writer. In short, I am
> saying keep things single writer.
> 
> That said, Of late, I have been thinking about this in the context of multi
> table transactions and seeing if we actually add the support.. Love to have
> some design partners if there is interest :)
> 
> Thanks
> Vinoth
> 
> On Tue, Apr 14, 2020 at 9:23 AM Scheller, Brandon
> <[email protected]> wrote:
> 
> > Hi all,
> >
> > If I understand correctly, Hudi is not currently recommended for the
> > concurrent writer use cases. I was wondering what the community’s official
> > stance on concurrency is, and what the recommended workarounds/solutions
> > are for Hudi to help prevent data corruption/duplication (For example we’ve
> > heard of environments using an external table lock).
> >
> > Thanks,
> > Brandon
> >
>

Re: Hudi concurrent writes

Reply via email to