Hi, We are assuming the following in getDeletePaths() method in cleaner flow in case of KEEP_LATEST_COMMITS policy -
/** * Selects the versions for file for cleaning, such that it * <p> * - Leaves the latest version of the file untouched - For older versions, - It leaves all the commits untouched which * has occured in last <code>config.getCleanerCommitsRetained()</code> commits - It leaves ONE commit before this * window. We assume that the max(query execution time) == commit_batch_time * config.getCleanerCommitsRetained(). * This is 12 hours by default. This is essential to leave the file used by the query thats running for the max time. * <p> * This provides the effect of having lookback into all changes that happened in the last X commits. (eg: if you * retain 24 commits, and commit batch time is 30 mins, then you have 12 hrs of lookback) * <p> * This policy is the default. */ I want to understand the term commit_batch_time in this assumption and the assumption as a whole. As per my understanding, this term refers to the time taken in one iteration of DeltaSync end to end (which is hardly 7-8 minutes in my case). If my understanding is correct, then this time will vary depending on the size of incoming RDD. So in that case, the time needed for the longest query is effectively a variable. So in that case what is a safe option to keep for the config <code>config.getCleanerCommitsRetained()</code>. Basically I want to set the config <code>config.getCleanerCommitsRetained()</code> properly for my Hudi instance and hence I am trying to understand the assumption. Its default value is 10, I want to understand if this can be reduced further without any query failing. Please help me with this. Regards Pratyaksh
