[ 
https://issues.apache.org/jira/browse/SPARK-57521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kincaid updated SPARK-57521:
------------------------------------
    Attachment: repro_ml_cache_size_bug.py

> SizeEstimator overcounts model size due to model.parent traversing to 
> SparkSession
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-57521
>                 URL: https://issues.apache.org/jira/browse/SPARK-57521
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, ML
>    Affects Versions: 4.0.0, 4.1.2, 4.0.3
>         Environment: Reproduced locally with:
>  - Spark Connect server: `spark-4.1.2-bin-hadoop3-connect`, local mode
>  - Client: `pyspark[connect]==4.1.2`, Python 3.13, macOS ARM64
>  - Java: OpenJDK 17.0.19
> Originally found in Databricks (serverless env 4 and 5, DBR 17.3 – Spark 
> versions 4.0.x).
>            Reporter: Michael Kincaid
>            Priority: Major
>         Attachments: repro_ml_cache_size_bug.py
>
>
> In Spark Connect's server-side ML model cache, `Model.estimatedSize` uses 
> `SizeEstimator.estimate(self)` which traverses the object graph through the 
> model's `@transient parent` field. For any estimator that executes DataFrame 
> operations during `fit()`, the parent retains an indirect reference chain to 
> the SparkSession/SparkContext (shared JVM state which is generally much 
> larger than the model). This causes the size estimate to include the size of 
> the SparkSession, which is an overcount since the SparkSession is there 
> anyway and is not attributable to the addition of this model to the ML cache. 
> The issue can become more severe if several instances of the model are 
> trained since the overcount will stack and cause phantom filling of the ML 
> cache limit.
> In a simple local environment the SparkSession might be ~300kb (still much 
> larger than the model but not practically important). In complex applications 
> like the Databricks runtime, the SparkSession might be 300-800MB, enough that 
> in some configurations the model training might immediately fail (Databricks 
> serverless max model size is 256M).
> Example minimal code to reproduce is attached, showing that on a trivial 
> DataFrame, a StringIndexer (which needs only a few kb for true model data) 
> will be estimated at 10-100x that size.
> Other estimators that perform DataFrame operations during fit() are affected, 
> e.g. CountVectorizer, StandardScaler, MinMaxScaler, IDF, Word2Vec. Those that 
> do not perform DataFrame operations, e.g. OneHotEncoder, do not seem to be 
> affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to