Re: Hudi concurrent writes

Vinoth Chandar Tue, 14 Apr 2020 14:28:25 -0700

Hi Brandon,

This is more of practical advice than sharing how to solve it using Hudi.
By and large, this need can be mitigated by serializing your writes in an
upstream message queue like Kafka.. For e.g , lets say you want to delete
some records in a table, that is being currently ingested by
deltastreamer.. All you need to do is log more deletes as described in this
blog here, into the upstream kafka topic.. This will serialize the writes
automatically for you. Atleast in my experience, I found this a much more
efficient way of doing this, rather than allowing two writers to proceed
and failing the all except the latest writer..  Downstream ETL tables built
using spark jobs also typically tend to single writer. In short, I am
saying keep things single writer.

That said, Of late, I have been thinking about this in the context of multi
table transactions and seeing if we actually add the support.. Love to have
some design partners if there is interest :)

Thanks
Vinoth

On Tue, Apr 14, 2020 at 9:23 AM Scheller, Brandon
<[email protected]> wrote:

> Hi all,
>
> If I understand correctly, Hudi is not currently recommended for the
> concurrent writer use cases. I was wondering what the community’s official
> stance on concurrency is, and what the recommended workarounds/solutions
> are for Hudi to help prevent data corruption/duplication (For example we’ve
> heard of environments using an external table lock).
>
> Thanks,
> Brandon
>

Re: Hudi concurrent writes

Reply via email to