Thank you for the clarification Balaji. Now I understand it properly. :) On Tue, Nov 19, 2019 at 11:17 PM Balaji Varadarajan <[email protected]> wrote:
> I updated the FAQ section to set defaults correctly and add more > information related to this : > > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo > > The cleaner retention configuration is based on counts (number of commits > to be retained) with the assumption that users need to provide a > conservative number. The historical reason was that ingestion used to run > in specific cadence (e.g every 30 mins) with the norm being an ingestion > run taking less than 30 mins. With this model, it was simpler to represent > the configuration as a count of commits to approximate the retention time. > > With delta-streamer continuous mode, ingestion is allowed to be scheduled > immediately after the previous run is scheduled. I think it would make > sense to introduce a time based retention. I have created a newbie ticket > for this : https://jira.apache.org/jira/browse/HUDI-349 > > Pratyaksh, In sum, if the defaults are too low, use a conservative number > based on the number of ingestion runs you see in your setup. The defaults > as referenced in the code-comments needs change (from 24 to 10).( > https://jira.apache.org/jira/browse/HUDI-350) > > Thanks, > Balaji.V > > On Tue, Nov 19, 2019 at 1:40 AM Pratyaksh Sharma <[email protected]> > wrote: > > > Hi, > > > > We are assuming the following in getDeletePaths() method in cleaner flow > in > > case of KEEP_LATEST_COMMITS policy - > > > > /** > > * Selects the versions for file for cleaning, such that it > > * <p> > > * - Leaves the latest version of the file untouched - For older > versions, - > > It leaves all the commits untouched which > > * has occured in last <code>config.getCleanerCommitsRetained()</code> > > commits - It leaves ONE commit before this > > * window. We assume that the max(query execution time) == > commit_batch_time > > * config.getCleanerCommitsRetained(). > > * This is 12 hours by default. This is essential to leave the file used > by > > the query thats running for the max time. > > * <p> > > * This provides the effect of having lookback into all changes that > > happened in the last X commits. (eg: if you > > * retain 24 commits, and commit batch time is 30 mins, then you have 12 > hrs > > of lookback) > > * <p> > > * This policy is the default. > > */ > > > > I want to understand the term commit_batch_time in this assumption and > the > > assumption as a whole. As per my understanding, this term refers to the > > time taken in one iteration of DeltaSync end to end (which is hardly 7-8 > > minutes in my case). If my understanding is correct, then this time will > > vary depending on the size of incoming RDD. So in that case, the time > > needed for the longest query is effectively a variable. So in that case > > what is a safe option to keep for the config > > <code>config.getCleanerCommitsRetained()</code>. > > > > Basically I want to set the config > > <code>config.getCleanerCommitsRetained()</code> properly for my Hudi > > instance and hence I am trying to understand the assumption. Its default > > value is 10, I want to understand if this can be reduced further without > > any query failing. > > > > Please help me with this. > > > > Regards > > Pratyaksh > > >
