[ 
https://issues.apache.org/jira/browse/SPARK-57521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kincaid updated SPARK-57521:
------------------------------------
    Description: 
In Spark Connect's server-side ML model cache, {{Model.estimatedSize}} uses 
{{SizeEstimator.estimate(self)}} which traverses the object graph through the 
model's {{@transient parent}} field. For any estimator that executes DataFrame 
operations during {{{}fit(){}}}, the parent retains an indirect reference chain 
to the {{{}SparkSession{}}}/{{{}SparkContext{}}} (shared JVM state which is 
generally much larger than the model). This causes the size estimate to include 
the size of the {{{}SparkSession{}}}, which is an overcount since the 
{{SparkSession}} is there anyway and is not attributable to the addition of 
this model to the ML cache. The issue can become more severe if several 
affected models are trained, since the overcount will stack and cause phantom 
filling of the ML cache limit.

In a simple local environment the SparkSession might be ~300kb (still much 
larger than the model but not practically important). In complex applications 
like the Databricks runtime, the SparkSession might be 300-800MB, enough that 
in some configurations the model training might immediately fail (Databricks 
serverless max model size is 256M).

Example minimal code to reproduce is attached, showing that on a trivial 
DataFrame, a {{StringIndexer}} (which needs only a few kb for true model data) 
will be estimated at 10-100x that size.

Other estimators that perform {{DataFrame}} operations during {{fit()}} are 
affected, e.g. {{{}CountVectorizer{}}}, {{{}StandardScaler{}}}, 
{{{}MinMaxScaler{}}}, {{{}IDF{}}}, {{{}Word2Vec{}}}. Those that do not perform 
DataFrame operations, e.g. {{{}OneHotEncoder{}}}, do not seem to be affected.

SPARK-52229 is similar in spirit and was fixed in a similar way (remove the 
reference that counts the `sparkSession`). SPARK-14966 is older and has some 
similarities.

  was:
In Spark Connect's server-side ML model cache, `Model.estimatedSize` uses 
`SizeEstimator.estimate(self)` which traverses the object graph through the 
model's `@transient parent` field. For any estimator that executes DataFrame 
operations during `fit()`, the parent retains an indirect reference chain to 
the SparkSession/SparkContext (shared JVM state which is generally much larger 
than the model). This causes the size estimate to include the size of the 
SparkSession, which is an overcount since the SparkSession is there anyway and 
is not attributable to the addition of this model to the ML cache. The issue 
can become more severe if several instances of the model are trained since the 
overcount will stack and cause phantom filling of the ML cache limit.

In a simple local environment the SparkSession might be ~300kb (still much 
larger than the model but not practically important). In complex applications 
like the Databricks runtime, the SparkSession might be 300-800MB, enough that 
in some configurations the model training might immediately fail (Databricks 
serverless max model size is 256M).

Example minimal code to reproduce is attached, showing that on a trivial 
DataFrame, a StringIndexer (which needs only a few kb for true model data) will 
be estimated at 10-100x that size.

Other estimators that perform DataFrame operations during fit() are affected, 
e.g. CountVectorizer, StandardScaler, MinMaxScaler, IDF, Word2Vec. Those that 
do not perform DataFrame operations, e.g. OneHotEncoder, do not seem to be 
affected.


> SizeEstimator overcounts model size due to model.parent traversing to 
> SparkSession
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-57521
>                 URL: https://issues.apache.org/jira/browse/SPARK-57521
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, ML
>    Affects Versions: 4.0.0, 4.1.2, 4.0.3
>         Environment: Reproduced locally with:
>  - Spark Connect server: `spark-4.1.2-bin-hadoop3-connect`, local mode
>  - Client: `pyspark[connect]==4.1.2`, Python 3.13, macOS ARM64
>  - Java: OpenJDK 17.0.19
> Originally found in Databricks (serverless env 4 and 5, DBR 17.3 – Spark 
> versions 4.0.x).
>            Reporter: Michael Kincaid
>            Priority: Major
>         Attachments: repro_ml_cache_size_bug.py
>
>
> In Spark Connect's server-side ML model cache, {{Model.estimatedSize}} uses 
> {{SizeEstimator.estimate(self)}} which traverses the object graph through the 
> model's {{@transient parent}} field. For any estimator that executes 
> DataFrame operations during {{{}fit(){}}}, the parent retains an indirect 
> reference chain to the {{{}SparkSession{}}}/{{{}SparkContext{}}} (shared JVM 
> state which is generally much larger than the model). This causes the size 
> estimate to include the size of the {{{}SparkSession{}}}, which is an 
> overcount since the {{SparkSession}} is there anyway and is not attributable 
> to the addition of this model to the ML cache. The issue can become more 
> severe if several affected models are trained, since the overcount will stack 
> and cause phantom filling of the ML cache limit.
> In a simple local environment the SparkSession might be ~300kb (still much 
> larger than the model but not practically important). In complex applications 
> like the Databricks runtime, the SparkSession might be 300-800MB, enough that 
> in some configurations the model training might immediately fail (Databricks 
> serverless max model size is 256M).
> Example minimal code to reproduce is attached, showing that on a trivial 
> DataFrame, a {{StringIndexer}} (which needs only a few kb for true model 
> data) will be estimated at 10-100x that size.
> Other estimators that perform {{DataFrame}} operations during {{fit()}} are 
> affected, e.g. {{{}CountVectorizer{}}}, {{{}StandardScaler{}}}, 
> {{{}MinMaxScaler{}}}, {{{}IDF{}}}, {{{}Word2Vec{}}}. Those that do not 
> perform DataFrame operations, e.g. {{{}OneHotEncoder{}}}, do not seem to be 
> affected.
> SPARK-52229 is similar in spirit and was fixed in a similar way (remove the 
> reference that counts the `sparkSession`). SPARK-14966 is older and has some 
> similarities.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to