Hi,

We are assuming the following in getDeletePaths() method in cleaner flow in
case of KEEP_LATEST_COMMITS policy -

/**
* Selects the versions for file for cleaning, such that it
* <p>
* - Leaves the latest version of the file untouched - For older versions, -
It leaves all the commits untouched which
* has occured in last <code>config.getCleanerCommitsRetained()</code>
commits - It leaves ONE commit before this
* window. We assume that the max(query execution time) == commit_batch_time
* config.getCleanerCommitsRetained().
* This is 12 hours by default. This is essential to leave the file used by
the query thats running for the max time.
* <p>
* This provides the effect of having lookback into all changes that
happened in the last X commits. (eg: if you
* retain 24 commits, and commit batch time is 30 mins, then you have 12 hrs
of lookback)
* <p>
* This policy is the default.
*/

I want to understand the term commit_batch_time in this assumption and the
assumption as a whole. As per my understanding, this term refers to the
time taken in one iteration of DeltaSync end to end (which is hardly 7-8
minutes in my case). If my understanding is correct, then this time will
vary depending on the size of incoming RDD. So in that case, the time
needed for the longest query is effectively a variable. So in that case
what is a safe option to keep for the config
<code>config.getCleanerCommitsRetained()</code>.

Basically I want to set the config
<code>config.getCleanerCommitsRetained()</code> properly for my Hudi
instance and hence I am trying to understand the assumption. Its default
value is 10, I want to understand if this can be reduced further without
any query failing.

Please help me with this.

Regards
Pratyaksh

Reply via email to