Michael Kincaid created SPARK-57521:
---------------------------------------

             Summary: SizeEstimator overcounts model size due to model.parent 
traversing to SparkSession
                 Key: SPARK-57521
                 URL: https://issues.apache.org/jira/browse/SPARK-57521
             Project: Spark
          Issue Type: Bug
          Components: Connect, ML
    Affects Versions: 4.0.3, 4.1.2, 4.0.0
         Environment: Reproduced locally with:
 - Spark Connect server: `spark-4.1.2-bin-hadoop3-connect`, local mode
 - Client: `pyspark[connect]==4.1.2`, Python 3.13, macOS ARM64
 - Java: OpenJDK 17.0.19

Originally found in Databricks (serverless env 4 and 5, DBR 17.3 – Spark 
versions 4.0.x).
            Reporter: Michael Kincaid
         Attachments: repro_ml_cache_size_bug.py

In Spark Connect's server-side ML model cache, `Model.estimatedSize` uses 
`SizeEstimator.estimate(self)` which traverses the object graph through the 
model's `@transient parent` field. For any estimator that executes DataFrame 
operations during `fit()`, the parent retains an indirect reference chain to 
the SparkSession/SparkContext (shared JVM state which is generally much larger 
than the model). This causes the size estimate to include the size of the 
SparkSession, which is an overcount since the SparkSession is there anyway and 
is not attributable to the addition of this model to the ML cache. The issue 
can become more severe if several instances of the model are trained since the 
overcount will stack and cause phantom filling of the ML cache limit.

In a simple local environment the SparkSession might be ~300kb (still much 
larger than the model but not practically important). In complex applications 
like the Databricks runtime, the SparkSession might be 300-800MB, enough that 
in some configurations the model training might immediately fail (Databricks 
serverless max model size is 256M).

Example minimal code to reproduce is attached, showing that on a trivial 
DataFrame, a StringIndexer (which needs only a few kb for true model data) will 
be estimated at 10-100x that size.

Other estimators that perform DataFrame operations during fit() are affected, 
e.g. CountVectorizer, StandardScaler, MinMaxScaler, IDF, Word2Vec. Those that 
do not perform DataFrame operations, e.g. OneHotEncoder, do not seem to be 
affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to