Hi Pratyaksh, Your understanding is correct. There is a duplicate fix tool in the cli (I wrote this a while ago for cow, but did use it in production few times for situations like these). Check that out? IIRC it will keep the both the commits and its files, but simply get rid of the duplicate records and replace parquet files in place.
>> Also once duplicates are written, you are not sure of which file the update will go to next since the record is already present in 2 different parquet files. IIRC bloom index will tag both files and both will be updated. Table could show many side effects depending on when exactly the race happened. - the second commit may have rolled back the first inflight commit and mistaking it for a failed write. In this case, some data may also be missing. In this case though i expect first commit to actually fail since files got deleted midway into writing. - if both of them indeed succeeded, then then its just the duplicates Thanks Vinoth On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <[email protected]> wrote: > Hi, > > From my experience so far of working with Hudi, I understand that Hudi is > not designed to handle concurrent writes from 2 different sources for > example 2 instances of HoodieDeltaStreamer are simultaneously running and > writing to the same dataset. I have experienced such a case can result in > duplicate writes in case of inserts. Also once duplicates are written, you > are not sure of which file the update will go to next since the record is > already present in 2 different parquet files. Please correct me if I am > wrong. > > Having experienced this in few Hudi datasets, I now want to delete one of > the parquet files which contains duplicates in some partition of a COW type > Hudi dataset. I want to know if deleting a parquet file manually can have > any repercussions? If yes, what all can be the side effects of doing the > same? > > Any leads will be highly appreciated. >
