Awesome. https://issues.apache.org/jira/browse/HUDI-796 tracks this.
On Wed, Apr 15, 2020 at 3:17 AM Vinoth Chandar <[email protected]> wrote: > Okay makes sense.. I think we can raise a PR for the tool, integrated into > the CLI.. > Then everyone can weigh in more as well ? > > Thanks for taking this up, Pratyaksh! > > On Tue, Apr 14, 2020 at 2:58 AM Pratyaksh Sharma <[email protected]> > wrote: > > > Hi Vinoth, > > > > Thank you for your guidance. > > > > I went through the code for RepairsCommand in Hudi-cli package which > > internally calls DedupeSparkJob.scala. The logic therein basically marks > > the file as bad based on the commit time of records. However in my case > > even the commit time is same for duplicates. The only thing that varies > is > > `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class > will > > not help me. > > > > IIUC the logic in DedupeSparkJob can only work when duplicates were > created > > due to INSERT operation. If we have UPDATE coming in for some duplicate > > record, then both the files where that record is present will have the > same > > commit time henceforth. Such cases cannot be dealt with by considering > > `_hoodie_commit_time` which is the same as I am experiencing. > > > > I have written one script to solve my use case. It is a no brainer where > I > > simply delete the duplicate keys and rewrite the file. Wanted to check if > > it would add any value to our code base and if I should raise a PR for > the > > same. If the community agrees, then we can work together to further > improve > > it and make it generic enough. > > > > On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar <[email protected]> > wrote: > > > > > Hi Pratyaksh, > > > > > > Your understanding is correct. There is a duplicate fix tool in the cli > > (I > > > wrote this a while ago for cow, but did use it in production few times > > for > > > situations like these). Check that out? IIRC it will keep the both the > > > commits and its files, but simply get rid of the duplicate records and > > > replace parquet files in place. > > > > > > >> Also once duplicates are written, you > > > are not sure of which file the update will go to next since the record > is > > > already present in 2 different parquet files. > > > > > > IIRC bloom index will tag both files and both will be updated. > > > > > > Table could show many side effects depending on when exactly the race > > > happened. > > > > > > - the second commit may have rolled back the first inflight commit and > > > mistaking it for a failed write. In this case, some data may also be > > > missing. In this case though i expect first commit to actually fail > since > > > files got deleted midway into writing. > > > - if both of them indeed succeeded, then then its just the duplicates > > > > > > > > > Thanks > > > Vinoth > > > > > > > > > > > > > > > > > > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma < > [email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > From my experience so far of working with Hudi, I understand that > Hudi > > is > > > > not designed to handle concurrent writes from 2 different sources for > > > > example 2 instances of HoodieDeltaStreamer are simultaneously running > > and > > > > writing to the same dataset. I have experienced such a case can > result > > in > > > > duplicate writes in case of inserts. Also once duplicates are > written, > > > you > > > > are not sure of which file the update will go to next since the > record > > is > > > > already present in 2 different parquet files. Please correct me if I > am > > > > wrong. > > > > > > > > Having experienced this in few Hudi datasets, I now want to delete > one > > of > > > > the parquet files which contains duplicates in some partition of a > COW > > > type > > > > Hudi dataset. I want to know if deleting a parquet file manually can > > have > > > > any repercussions? If yes, what all can be the side effects of doing > > the > > > > same? > > > > > > > > Any leads will be highly appreciated. > > > > > > > > > >
