Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

via GitHub Mon, 24 Feb 2025 18:15:28 -0800


hvanhovell commented on code in PR #50013:
URL: https://github.com/apache/spark/pull/50013#discussion_r1968711609



##########
mllib/src/main/scala/org/apache/spark/ml/util/Summary.scala:
##########
@@ -18,11 +18,21 @@
 package org.apache.spark.ml.util
 
 import org.apache.spark.annotation.Since
+import org.apache.spark.util.KnownSizeEstimation
 
 /**
+ * For ml connect only.
  * Trait for the Summary
  * All the summaries should extend from this Summary in order to
  * support connect.
  */
 @Since("4.0.0")
-private[spark] trait Summary
+private[spark] trait Summary extends KnownSizeEstimation {
+
+  // A summary is normally a small object, with several RDDs or DataFrame.
+  // The SizeEstimator is likely to overestimate the size of the summary,
+  // because it will also count the underlying SparkSession and/or 
SparkContext,

Review Comment:
   This is why I don't like the SizeEstimator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-51261][ML][PYTHON][CONNECT] Introduce model size estimation to control ml cache [spark]

Reply via email to