HeartSaVioR commented on code in PR #47393: URL: https://github.com/apache/spark/pull/47393#discussion_r1683718327
########## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ########## @@ -2129,6 +2129,13 @@ object SQLConf { .intConf .createWithDefault(100) + val MIN_VERSIONS_TO_DELETE = buildConf("spark.sql.streaming.minVersionsToDelete") + .internal() + .doc("The minimum number of stale versions to delete when maintenance is invoked.") + .version("2.1.1") + .intConf + .createWithDefault(30) Review Comment: minBatchesToRetain is a guaranteed lower bound of the number of retained batches, but it has been one side guarantee. This config is to leverage the other side which we do not guarantee to loose more, so that we make trade-off between number of files vs API cost. I think this default value suggested here (30) is in line with the default value of minBatchesToRetain (100). If they don't mind retaining 100 batches they also wouldn't mind retaining 130-140 batches to reduce down the overhead and cost. For someone who really cares to minimize the number of files, they should have changed the config for minBatchesToRetain, which they could also change the config on this. (Btw, I agree that minBatchesToRetain to be 100 by default hadn't been making sense, but with state data source reader, it figures out at least one of usages. I don't have strong opinion whether the default value has to be enforced to support this use case though.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org