Hi Pratyaksh,

Your understanding is correct. There is a duplicate fix tool in the cli (I
wrote this a while ago for cow, but did use it in production few times for
situations like these). Check that out? IIRC it will keep the both the
commits and its files, but simply get rid of the duplicate records and
replace parquet files in place.

>> Also once duplicates are written, you
are not sure of which file the update will go to next since the record is
already present in 2 different parquet files.

IIRC bloom index will tag both files and both will be updated.

Table could show many side effects depending on when exactly the race
happened.

- the second commit may have rolled back the first inflight commit and
mistaking it for a failed write. In this case, some data may also be
missing. In this case though i expect first commit to actually fail since
files got deleted midway into writing.
- if both of them indeed succeeded, then then its just the duplicates


Thanks
Vinoth





On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <[email protected]>
wrote:

> Hi,
>
> From my experience so far of working with Hudi, I understand that Hudi is
> not designed to handle concurrent writes from 2 different sources for
> example 2 instances of HoodieDeltaStreamer are simultaneously running and
> writing to the same dataset. I have experienced such a case can result in
> duplicate writes in case of inserts. Also once duplicates are written, you
> are not sure of which file the update will go to next since the record is
> already present in 2 different parquet files. Please correct me if I am
> wrong.
>
> Having experienced this in few Hudi datasets, I now want to delete one of
> the parquet files which contains duplicates in some partition of a COW type
> Hudi dataset. I want to know if deleting a parquet file manually can have
> any repercussions? If yes, what all can be the side effects of doing the
> same?
>
> Any leads will be highly appreciated.
>

Reply via email to