Okay makes sense.. I think we can raise a PR for the tool, integrated into the CLI.. Then everyone can weigh in more as well ?
Thanks for taking this up, Pratyaksh! On Tue, Apr 14, 2020 at 2:58 AM Pratyaksh Sharma <[email protected]> wrote: > Hi Vinoth, > > Thank you for your guidance. > > I went through the code for RepairsCommand in Hudi-cli package which > internally calls DedupeSparkJob.scala. The logic therein basically marks > the file as bad based on the commit time of records. However in my case > even the commit time is same for duplicates. The only thing that varies is > `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class will > not help me. > > IIUC the logic in DedupeSparkJob can only work when duplicates were created > due to INSERT operation. If we have UPDATE coming in for some duplicate > record, then both the files where that record is present will have the same > commit time henceforth. Such cases cannot be dealt with by considering > `_hoodie_commit_time` which is the same as I am experiencing. > > I have written one script to solve my use case. It is a no brainer where I > simply delete the duplicate keys and rewrite the file. Wanted to check if > it would add any value to our code base and if I should raise a PR for the > same. If the community agrees, then we can work together to further improve > it and make it generic enough. > > On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar <[email protected]> wrote: > > > Hi Pratyaksh, > > > > Your understanding is correct. There is a duplicate fix tool in the cli > (I > > wrote this a while ago for cow, but did use it in production few times > for > > situations like these). Check that out? IIRC it will keep the both the > > commits and its files, but simply get rid of the duplicate records and > > replace parquet files in place. > > > > >> Also once duplicates are written, you > > are not sure of which file the update will go to next since the record is > > already present in 2 different parquet files. > > > > IIRC bloom index will tag both files and both will be updated. > > > > Table could show many side effects depending on when exactly the race > > happened. > > > > - the second commit may have rolled back the first inflight commit and > > mistaking it for a failed write. In this case, some data may also be > > missing. In this case though i expect first commit to actually fail since > > files got deleted midway into writing. > > - if both of them indeed succeeded, then then its just the duplicates > > > > > > Thanks > > Vinoth > > > > > > > > > > > > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <[email protected]> > > wrote: > > > > > Hi, > > > > > > From my experience so far of working with Hudi, I understand that Hudi > is > > > not designed to handle concurrent writes from 2 different sources for > > > example 2 instances of HoodieDeltaStreamer are simultaneously running > and > > > writing to the same dataset. I have experienced such a case can result > in > > > duplicate writes in case of inserts. Also once duplicates are written, > > you > > > are not sure of which file the update will go to next since the record > is > > > already present in 2 different parquet files. Please correct me if I am > > > wrong. > > > > > > Having experienced this in few Hudi datasets, I now want to delete one > of > > > the parquet files which contains duplicates in some partition of a COW > > type > > > Hudi dataset. I want to know if deleting a parquet file manually can > have > > > any repercussions? If yes, what all can be the side effects of doing > the > > > same? > > > > > > Any leads will be highly appreciated. > > > > > >
