Re: [PR] [SPARK-48931][SS] Reduce Cloud Store List API cost for state store maintenance task [spark]

via GitHub Thu, 18 Jul 2024 20:09:20 -0700


HeartSaVioR commented on code in PR #47393:
URL: https://github.com/apache/spark/pull/47393#discussion_r1683718327



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2129,6 +2129,13 @@ object SQLConf {
     .intConf
     .createWithDefault(100)
 
+  val MIN_VERSIONS_TO_DELETE = 
buildConf("spark.sql.streaming.minVersionsToDelete")
+    .internal()
+    .doc("The minimum number of stale versions to delete when maintenance is 
invoked.")
+    .version("2.1.1")
+    .intConf
+    .createWithDefault(30)

Review Comment:
   minBatchesToRetain is a guaranteed lower bound of the number of retained 
batches, but it has been one side guarantee. This config is to leverage the 
other side which we do not guarantee to loose more, so that we make trade-off 
between number of files vs API cost.
   
   I think this default value suggested here (30) is in line with the default 
value of minBatchesToRetain (100). If they don't mind retaining 100 batches 
they also wouldn't mind retaining 130-140 batches to reduce down the overhead 
and cost. For someone who really cares to minimize the number of files, they 
should have changed the config for minBatchesToRetain, which they could also 
change the config on this.
   
   (Btw, I agree that minBatchesToRetain to be 100 by default hadn't been 
making sense, but with state data source reader, it figures out at least one of 
usages. I don't have strong opinion whether the default value has to be 
enforced to support this use case though.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48931][SS] Reduce Cloud Store List API cost for state store maintenance task [spark]

Reply via email to