Michael Kincaid created SPARK-57521:
---------------------------------------
Summary: SizeEstimator overcounts model size due to model.parent
traversing to SparkSession
Key: SPARK-57521
URL: https://issues.apache.org/jira/browse/SPARK-57521
Project: Spark
Issue Type: Bug
Components: Connect, ML
Affects Versions: 4.0.3, 4.1.2, 4.0.0
Environment: Reproduced locally with:
- Spark Connect server: `spark-4.1.2-bin-hadoop3-connect`, local mode
- Client: `pyspark[connect]==4.1.2`, Python 3.13, macOS ARM64
- Java: OpenJDK 17.0.19
Originally found in Databricks (serverless env 4 and 5, DBR 17.3 – Spark
versions 4.0.x).
Reporter: Michael Kincaid
Attachments: repro_ml_cache_size_bug.py
In Spark Connect's server-side ML model cache, `Model.estimatedSize` uses
`SizeEstimator.estimate(self)` which traverses the object graph through the
model's `@transient parent` field. For any estimator that executes DataFrame
operations during `fit()`, the parent retains an indirect reference chain to
the SparkSession/SparkContext (shared JVM state which is generally much larger
than the model). This causes the size estimate to include the size of the
SparkSession, which is an overcount since the SparkSession is there anyway and
is not attributable to the addition of this model to the ML cache. The issue
can become more severe if several instances of the model are trained since the
overcount will stack and cause phantom filling of the ML cache limit.
In a simple local environment the SparkSession might be ~300kb (still much
larger than the model but not practically important). In complex applications
like the Databricks runtime, the SparkSession might be 300-800MB, enough that
in some configurations the model training might immediately fail (Databricks
serverless max model size is 256M).
Example minimal code to reproduce is attached, showing that on a trivial
DataFrame, a StringIndexer (which needs only a few kb for true model data) will
be estimated at 10-100x that size.
Other estimators that perform DataFrame operations during fit() are affected,
e.g. CountVectorizer, StandardScaler, MinMaxScaler, IDF, Word2Vec. Those that
do not perform DataFrame operations, e.g. OneHotEncoder, do not seem to be
affected.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]