[
https://issues.apache.org/jira/browse/SPARK-57521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Kincaid updated SPARK-57521:
------------------------------------
Attachment: repro_ml_cache_size_bug.py
> SizeEstimator overcounts model size due to model.parent traversing to
> SparkSession
> ----------------------------------------------------------------------------------
>
> Key: SPARK-57521
> URL: https://issues.apache.org/jira/browse/SPARK-57521
> Project: Spark
> Issue Type: Bug
> Components: Connect, ML
> Affects Versions: 4.0.0, 4.1.2, 4.0.3
> Environment: Reproduced locally with:
> - Spark Connect server: `spark-4.1.2-bin-hadoop3-connect`, local mode
> - Client: `pyspark[connect]==4.1.2`, Python 3.13, macOS ARM64
> - Java: OpenJDK 17.0.19
> Originally found in Databricks (serverless env 4 and 5, DBR 17.3 – Spark
> versions 4.0.x).
> Reporter: Michael Kincaid
> Priority: Major
> Attachments: repro_ml_cache_size_bug.py
>
>
> In Spark Connect's server-side ML model cache, `Model.estimatedSize` uses
> `SizeEstimator.estimate(self)` which traverses the object graph through the
> model's `@transient parent` field. For any estimator that executes DataFrame
> operations during `fit()`, the parent retains an indirect reference chain to
> the SparkSession/SparkContext (shared JVM state which is generally much
> larger than the model). This causes the size estimate to include the size of
> the SparkSession, which is an overcount since the SparkSession is there
> anyway and is not attributable to the addition of this model to the ML cache.
> The issue can become more severe if several instances of the model are
> trained since the overcount will stack and cause phantom filling of the ML
> cache limit.
> In a simple local environment the SparkSession might be ~300kb (still much
> larger than the model but not practically important). In complex applications
> like the Databricks runtime, the SparkSession might be 300-800MB, enough that
> in some configurations the model training might immediately fail (Databricks
> serverless max model size is 256M).
> Example minimal code to reproduce is attached, showing that on a trivial
> DataFrame, a StringIndexer (which needs only a few kb for true model data)
> will be estimated at 10-100x that size.
> Other estimators that perform DataFrame operations during fit() are affected,
> e.g. CountVectorizer, StandardScaler, MinMaxScaler, IDF, Word2Vec. Those that
> do not perform DataFrame operations, e.g. OneHotEncoder, do not seem to be
> affected.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]