Okay makes sense.. I think we can raise a PR for the tool, integrated into
the CLI..
Then everyone can weigh in more as well ?

Thanks for taking this up, Pratyaksh!

On Tue, Apr 14, 2020 at 2:58 AM Pratyaksh Sharma <[email protected]>
wrote:

> Hi Vinoth,
>
> Thank you for your guidance.
>
> I went through the code for RepairsCommand in Hudi-cli package which
> internally calls DedupeSparkJob.scala. The logic therein basically marks
> the file as bad based on the commit time of records. However in my case
> even the commit time is same for duplicates. The only thing that varies is
> `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class will
> not help me.
>
> IIUC the logic in DedupeSparkJob can only work when duplicates were created
> due to INSERT operation. If we have UPDATE coming in for some duplicate
> record, then both the files where that record is present will have the same
> commit time henceforth. Such cases cannot be dealt with by considering
> `_hoodie_commit_time` which is the same as I am experiencing.
>
> I have written one script to solve my use case. It is a no brainer where I
> simply delete the duplicate keys and rewrite the file. Wanted to check if
> it would add any value to our code base and if I should raise a PR for the
> same. If the community agrees, then we can work together to further improve
> it and make it generic enough.
>
> On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar <[email protected]> wrote:
>
> > Hi Pratyaksh,
> >
> > Your understanding is correct. There is a duplicate fix tool in the cli
> (I
> > wrote this a while ago for cow, but did use it in production few times
> for
> > situations like these). Check that out? IIRC it will keep the both the
> > commits and its files, but simply get rid of the duplicate records and
> > replace parquet files in place.
> >
> > >> Also once duplicates are written, you
> > are not sure of which file the update will go to next since the record is
> > already present in 2 different parquet files.
> >
> > IIRC bloom index will tag both files and both will be updated.
> >
> > Table could show many side effects depending on when exactly the race
> > happened.
> >
> > - the second commit may have rolled back the first inflight commit and
> > mistaking it for a failed write. In this case, some data may also be
> > missing. In this case though i expect first commit to actually fail since
> > files got deleted midway into writing.
> > - if both of them indeed succeeded, then then its just the duplicates
> >
> >
> > Thanks
> > Vinoth
> >
> >
> >
> >
> >
> > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > From my experience so far of working with Hudi, I understand that Hudi
> is
> > > not designed to handle concurrent writes from 2 different sources for
> > > example 2 instances of HoodieDeltaStreamer are simultaneously running
> and
> > > writing to the same dataset. I have experienced such a case can result
> in
> > > duplicate writes in case of inserts. Also once duplicates are written,
> > you
> > > are not sure of which file the update will go to next since the record
> is
> > > already present in 2 different parquet files. Please correct me if I am
> > > wrong.
> > >
> > > Having experienced this in few Hudi datasets, I now want to delete one
> of
> > > the parquet files which contains duplicates in some partition of a COW
> > type
> > > Hudi dataset. I want to know if deleting a parquet file manually can
> have
> > > any repercussions? If yes, what all can be the side effects of doing
> the
> > > same?
> > >
> > > Any leads will be highly appreciated.
> > >
> >
>

Reply via email to