[
https://issues.apache.org/jira/browse/SPARK-57521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Kincaid updated SPARK-57521:
------------------------------------
Description:
In Spark Connect's server-side ML model cache, {{Model.estimatedSize}} uses
{{SizeEstimator.estimate(self)}} which traverses the object graph through the
model's {{@transient parent}} field. For any estimator that executes DataFrame
operations during {{{}fit(){}}}, the parent retains an indirect reference chain
to the {{{}SparkSession{}}}/{{{}SparkContext{}}} (shared JVM state which is
generally much larger than the model). This causes the size estimate to include
the size of the {{{}SparkSession{}}}, which is an overcount since the
{{SparkSession}} is there anyway and is not attributable to the addition of
this model to the ML cache. The issue can become more severe if several
affected models are trained, since the overcount will stack and cause phantom
filling of the ML cache limit.
In a simple local environment the SparkSession might be ~300kb (still much
larger than the model but not practically important). In complex applications
like the Databricks runtime, the SparkSession might be 300-800MB, enough that
in some configurations the model training might immediately fail (Databricks
serverless max model size is 256M).
Example minimal code to reproduce is attached, showing that on a trivial
DataFrame, a {{StringIndexer}} (which needs only a few kb for true model data)
will be estimated at 10-100x that size.
Other estimators that perform {{DataFrame}} operations during {{fit()}} are
affected, e.g. {{{}CountVectorizer{}}}, {{{}StandardScaler{}}},
{{{}MinMaxScaler{}}}, {{{}IDF{}}}, {{{}Word2Vec{}}}. Those that do not perform
DataFrame operations, e.g. {{{}OneHotEncoder{}}}, do not seem to be affected.
SPARK-52229 is similar in spirit and was fixed in a similar way (remove the
reference that counts the `sparkSession`). SPARK-14966 is older and has some
similarities.
was:
In Spark Connect's server-side ML model cache, `Model.estimatedSize` uses
`SizeEstimator.estimate(self)` which traverses the object graph through the
model's `@transient parent` field. For any estimator that executes DataFrame
operations during `fit()`, the parent retains an indirect reference chain to
the SparkSession/SparkContext (shared JVM state which is generally much larger
than the model). This causes the size estimate to include the size of the
SparkSession, which is an overcount since the SparkSession is there anyway and
is not attributable to the addition of this model to the ML cache. The issue
can become more severe if several instances of the model are trained since the
overcount will stack and cause phantom filling of the ML cache limit.
In a simple local environment the SparkSession might be ~300kb (still much
larger than the model but not practically important). In complex applications
like the Databricks runtime, the SparkSession might be 300-800MB, enough that
in some configurations the model training might immediately fail (Databricks
serverless max model size is 256M).
Example minimal code to reproduce is attached, showing that on a trivial
DataFrame, a StringIndexer (which needs only a few kb for true model data) will
be estimated at 10-100x that size.
Other estimators that perform DataFrame operations during fit() are affected,
e.g. CountVectorizer, StandardScaler, MinMaxScaler, IDF, Word2Vec. Those that
do not perform DataFrame operations, e.g. OneHotEncoder, do not seem to be
affected.
> SizeEstimator overcounts model size due to model.parent traversing to
> SparkSession
> ----------------------------------------------------------------------------------
>
> Key: SPARK-57521
> URL: https://issues.apache.org/jira/browse/SPARK-57521
> Project: Spark
> Issue Type: Bug
> Components: Connect, ML
> Affects Versions: 4.0.0, 4.1.2, 4.0.3
> Environment: Reproduced locally with:
> - Spark Connect server: `spark-4.1.2-bin-hadoop3-connect`, local mode
> - Client: `pyspark[connect]==4.1.2`, Python 3.13, macOS ARM64
> - Java: OpenJDK 17.0.19
> Originally found in Databricks (serverless env 4 and 5, DBR 17.3 – Spark
> versions 4.0.x).
> Reporter: Michael Kincaid
> Priority: Major
> Attachments: repro_ml_cache_size_bug.py
>
>
> In Spark Connect's server-side ML model cache, {{Model.estimatedSize}} uses
> {{SizeEstimator.estimate(self)}} which traverses the object graph through the
> model's {{@transient parent}} field. For any estimator that executes
> DataFrame operations during {{{}fit(){}}}, the parent retains an indirect
> reference chain to the {{{}SparkSession{}}}/{{{}SparkContext{}}} (shared JVM
> state which is generally much larger than the model). This causes the size
> estimate to include the size of the {{{}SparkSession{}}}, which is an
> overcount since the {{SparkSession}} is there anyway and is not attributable
> to the addition of this model to the ML cache. The issue can become more
> severe if several affected models are trained, since the overcount will stack
> and cause phantom filling of the ML cache limit.
> In a simple local environment the SparkSession might be ~300kb (still much
> larger than the model but not practically important). In complex applications
> like the Databricks runtime, the SparkSession might be 300-800MB, enough that
> in some configurations the model training might immediately fail (Databricks
> serverless max model size is 256M).
> Example minimal code to reproduce is attached, showing that on a trivial
> DataFrame, a {{StringIndexer}} (which needs only a few kb for true model
> data) will be estimated at 10-100x that size.
> Other estimators that perform {{DataFrame}} operations during {{fit()}} are
> affected, e.g. {{{}CountVectorizer{}}}, {{{}StandardScaler{}}},
> {{{}MinMaxScaler{}}}, {{{}IDF{}}}, {{{}Word2Vec{}}}. Those that do not
> perform DataFrame operations, e.g. {{{}OneHotEncoder{}}}, do not seem to be
> affected.
> SPARK-52229 is similar in spirit and was fixed in a similar way (remove the
> reference that counts the `sparkSession`). SPARK-14966 is older and has some
> similarities.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]