Re: Small clarification in Hoodie Cleaner flow

Pratyaksh Sharma Tue, 19 Nov 2019 22:37:11 -0800

Thank you for the clarification Balaji. Now I understand it properly. :)

On Tue, Nov 19, 2019 at 11:17 PM Balaji Varadarajan <[email protected]>
wrote:


> I updated the FAQ section to set defaults correctly and add more
> information related to this :
>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo
>
> The cleaner retention configuration is based on counts (number of commits
> to be retained) with the assumption that users need to provide a
> conservative number. The historical reason was that ingestion used to run
> in specific cadence (e.g every 30 mins) with the norm being an ingestion
> run taking less than 30 mins. With this model, it was simpler to represent
> the configuration as a count of commits to approximate the retention time.
>
> With delta-streamer continuous mode, ingestion is allowed to be scheduled
> immediately after the previous run is scheduled. I think it would make
> sense to introduce a time based retention. I have created a newbie ticket
> for this : https://jira.apache.org/jira/browse/HUDI-349
>
> Pratyaksh, In sum, if the defaults are too low, use a conservative number
> based on the number of ingestion runs you see in your setup. The defaults
> as referenced in the code-comments needs change (from 24 to 10).(
> https://jira.apache.org/jira/browse/HUDI-350)
>
> Thanks,
> Balaji.V
>
> On Tue, Nov 19, 2019 at 1:40 AM Pratyaksh Sharma <[email protected]>
> wrote:
>
> > Hi,
> >
> > We are assuming the following in getDeletePaths() method in cleaner flow
> in
> > case of KEEP_LATEST_COMMITS policy -
> >
> > /**
> > * Selects the versions for file for cleaning, such that it
> > * <p>
> > * - Leaves the latest version of the file untouched - For older
> versions, -
> > It leaves all the commits untouched which
> > * has occured in last <code>config.getCleanerCommitsRetained()</code>
> > commits - It leaves ONE commit before this
> > * window. We assume that the max(query execution time) ==
> commit_batch_time
> > * config.getCleanerCommitsRetained().
> > * This is 12 hours by default. This is essential to leave the file used
> by
> > the query thats running for the max time.
> > * <p>
> > * This provides the effect of having lookback into all changes that
> > happened in the last X commits. (eg: if you
> > * retain 24 commits, and commit batch time is 30 mins, then you have 12
> hrs
> > of lookback)
> > * <p>
> > * This policy is the default.
> > */
> >
> > I want to understand the term commit_batch_time in this assumption and
> the
> > assumption as a whole. As per my understanding, this term refers to the
> > time taken in one iteration of DeltaSync end to end (which is hardly 7-8
> > minutes in my case). If my understanding is correct, then this time will
> > vary depending on the size of incoming RDD. So in that case, the time
> > needed for the longest query is effectively a variable. So in that case
> > what is a safe option to keep for the config
> > <code>config.getCleanerCommitsRetained()</code>.
> >
> > Basically I want to set the config
> > <code>config.getCleanerCommitsRetained()</code> properly for my Hudi
> > instance and hence I am trying to understand the assumption. Its default
> > value is 10, I want to understand if this can be reduced further without
> > any query failing.
> >
> > Please help me with this.
> >
> > Regards
> > Pratyaksh
> >
>

Re: Small clarification in Hoodie Cleaner flow

Reply via email to