[ 
https://issues.apache.org/jira/browse/SPARK-57521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089817#comment-18089817
 ] 

Anupam Yadav commented on SPARK-57521:
--------------------------------------

I would like to take this one. The inflated estimate comes from 
Model.estimatedSize calling SizeEstimator.estimate(self), which walks the 
transient parent reference into the shared SparkSession/SparkContext (much 
larger than the model itself). I plan to keep that shared state out of the 
model size estimate. Will open a PR shortly.

> SizeEstimator overcounts model size due to model.parent traversing to 
> SparkSession
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-57521
>                 URL: https://issues.apache.org/jira/browse/SPARK-57521
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, ML
>    Affects Versions: 4.0.0, 4.1.2, 4.0.3
>         Environment: Reproduced locally with:
>  - Spark Connect server: `spark-4.1.2-bin-hadoop3-connect`, local mode
>  - Client: `pyspark[connect]==4.1.2`, Python 3.13, macOS ARM64
>  - Java: OpenJDK 17.0.19
> Originally found in Databricks (serverless env 4 and 5, DBR 17.3 – Spark 
> versions 4.0.x).
>            Reporter: Michael Kincaid
>            Priority: Major
>         Attachments: repro_ml_cache_size_bug.py
>
>
> In Spark Connect's server-side ML model cache, {{Model.estimatedSize}} uses 
> {{SizeEstimator.estimate(self)}} which traverses the object graph through the 
> model's {{@transient parent}} field. For any estimator that executes 
> DataFrame operations during {{{}fit(){}}}, the parent retains an indirect 
> reference chain to the {{{}SparkSession{}}}/{{{}SparkContext{}}} (shared JVM 
> state which is generally much larger than the model). This causes the size 
> estimate to include the size of the {{{}SparkSession{}}}, which is an 
> overcount since the {{SparkSession}} is there anyway and is not attributable 
> to the addition of this model to the ML cache. The issue can become more 
> severe if several affected models are trained, since the overcount will stack 
> and cause phantom filling of the ML cache limit.
> In a simple local environment the SparkSession might be ~300kb (still much 
> larger than the model but not practically important). In complex applications 
> like the Databricks runtime, the SparkSession might be 300-800MB, enough that 
> in some configurations the model training might immediately fail (Databricks 
> serverless max model size is 256M).
> Example minimal code to reproduce is attached, showing that on a trivial 
> DataFrame, a {{StringIndexer}} (which needs only a few kb for true model 
> data) will be estimated at 10-100x that size.
> Other estimators that perform {{DataFrame}} operations during {{fit()}} are 
> affected, e.g. {{{}CountVectorizer{}}}, {{{}StandardScaler{}}}, 
> {{{}MinMaxScaler{}}}, {{{}IDF{}}}, {{{}Word2Vec{}}}. Those that do not 
> perform DataFrame operations, e.g. {{{}OneHotEncoder{}}}, do not seem to be 
> affected.
> SPARK-52229 is similar in spirit and was fixed in a similar way (remove the 
> reference that counts the `sparkSession`). SPARK-14966 is older and has some 
> similarities.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to