Awesome.

https://issues.apache.org/jira/browse/HUDI-796 tracks this.

On Wed, Apr 15, 2020 at 3:17 AM Vinoth Chandar <[email protected]> wrote:

> Okay makes sense.. I think we can raise a PR for the tool, integrated into
> the CLI..
> Then everyone can weigh in more as well ?
>
> Thanks for taking this up, Pratyaksh!
>
> On Tue, Apr 14, 2020 at 2:58 AM Pratyaksh Sharma <[email protected]>
> wrote:
>
> > Hi Vinoth,
> >
> > Thank you for your guidance.
> >
> > I went through the code for RepairsCommand in Hudi-cli package which
> > internally calls DedupeSparkJob.scala. The logic therein basically marks
> > the file as bad based on the commit time of records. However in my case
> > even the commit time is same for duplicates. The only thing that varies
> is
> > `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class
> will
> > not help me.
> >
> > IIUC the logic in DedupeSparkJob can only work when duplicates were
> created
> > due to INSERT operation. If we have UPDATE coming in for some duplicate
> > record, then both the files where that record is present will have the
> same
> > commit time henceforth. Such cases cannot be dealt with by considering
> > `_hoodie_commit_time` which is the same as I am experiencing.
> >
> > I have written one script to solve my use case. It is a no brainer where
> I
> > simply delete the duplicate keys and rewrite the file. Wanted to check if
> > it would add any value to our code base and if I should raise a PR for
> the
> > same. If the community agrees, then we can work together to further
> improve
> > it and make it generic enough.
> >
> > On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar <[email protected]>
> wrote:
> >
> > > Hi Pratyaksh,
> > >
> > > Your understanding is correct. There is a duplicate fix tool in the cli
> > (I
> > > wrote this a while ago for cow, but did use it in production few times
> > for
> > > situations like these). Check that out? IIRC it will keep the both the
> > > commits and its files, but simply get rid of the duplicate records and
> > > replace parquet files in place.
> > >
> > > >> Also once duplicates are written, you
> > > are not sure of which file the update will go to next since the record
> is
> > > already present in 2 different parquet files.
> > >
> > > IIRC bloom index will tag both files and both will be updated.
> > >
> > > Table could show many side effects depending on when exactly the race
> > > happened.
> > >
> > > - the second commit may have rolled back the first inflight commit and
> > > mistaking it for a failed write. In this case, some data may also be
> > > missing. In this case though i expect first commit to actually fail
> since
> > > files got deleted midway into writing.
> > > - if both of them indeed succeeded, then then its just the duplicates
> > >
> > >
> > > Thanks
> > > Vinoth
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <
> [email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > From my experience so far of working with Hudi, I understand that
> Hudi
> > is
> > > > not designed to handle concurrent writes from 2 different sources for
> > > > example 2 instances of HoodieDeltaStreamer are simultaneously running
> > and
> > > > writing to the same dataset. I have experienced such a case can
> result
> > in
> > > > duplicate writes in case of inserts. Also once duplicates are
> written,
> > > you
> > > > are not sure of which file the update will go to next since the
> record
> > is
> > > > already present in 2 different parquet files. Please correct me if I
> am
> > > > wrong.
> > > >
> > > > Having experienced this in few Hudi datasets, I now want to delete
> one
> > of
> > > > the parquet files which contains duplicates in some partition of a
> COW
> > > type
> > > > Hudi dataset. I want to know if deleting a parquet file manually can
> > have
> > > > any repercussions? If yes, what all can be the side effects of doing
> > the
> > > > same?
> > > >
> > > > Any leads will be highly appreciated.
> > > >
> > >
> >
>

Reply via email to