[
https://issues.apache.org/jira/browse/SPARK-57521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57521:
-----------------------------------
Labels: pull-request-available (was: )
> SizeEstimator overcounts model size due to model.parent traversing to
> SparkSession
> ----------------------------------------------------------------------------------
>
> Key: SPARK-57521
> URL: https://issues.apache.org/jira/browse/SPARK-57521
> Project: Spark
> Issue Type: Bug
> Components: Connect, ML
> Affects Versions: 4.0.0, 4.1.2, 4.0.3
> Environment: Reproduced locally with:
> - Spark Connect server: `spark-4.1.2-bin-hadoop3-connect`, local mode
> - Client: `pyspark[connect]==4.1.2`, Python 3.13, macOS ARM64
> - Java: OpenJDK 17.0.19
> Originally found in Databricks (serverless env 4 and 5, DBR 17.3 – Spark
> versions 4.0.x).
> Reporter: Michael Kincaid
> Priority: Major
> Labels: pull-request-available
> Attachments: repro_ml_cache_size_bug.py
>
>
> In Spark Connect's server-side ML model cache, {{Model.estimatedSize}} uses
> {{SizeEstimator.estimate(self)}} which traverses the object graph through the
> model's {{@transient parent}} field. For any estimator that executes
> DataFrame operations during {{{}fit(){}}}, the parent retains an indirect
> reference chain to the {{{}SparkSession{}}}/{{{}SparkContext{}}} (shared JVM
> state which is generally much larger than the model). This causes the size
> estimate to include the size of the {{{}SparkSession{}}}, which is an
> overcount since the {{SparkSession}} is there anyway and is not attributable
> to the addition of this model to the ML cache. The issue can become more
> severe if several affected models are trained, since the overcount will stack
> and cause phantom filling of the ML cache limit.
> In a simple local environment the SparkSession might be ~300kb (still much
> larger than the model but not practically important). In complex applications
> like the Databricks runtime, the SparkSession might be 300-800MB, enough that
> in some configurations the model training might immediately fail (Databricks
> serverless max model size is 256M).
> Example minimal code to reproduce is attached, showing that on a trivial
> DataFrame, a {{StringIndexer}} (which needs only a few kb for true model
> data) will be estimated at 10-100x that size.
> Other estimators that perform {{DataFrame}} operations during {{fit()}} are
> affected, e.g. {{{}CountVectorizer{}}}, {{{}StandardScaler{}}},
> {{{}MinMaxScaler{}}}, {{{}IDF{}}}, {{{}Word2Vec{}}}. Those that do not
> perform DataFrame operations, e.g. {{{}OneHotEncoder{}}}, do not seem to be
> affected.
> SPARK-52229 is similar in spirit and was fixed in a similar way (remove the
> reference that counts the `sparkSession`). SPARK-14966 is older and has some
> similarities.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]