[
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph K. Bradley updated SPARK-21972:
--------------------------------------
Target Version/s: 2.4.0 (was: 2.3.0)
> Allow users to control input data persistence in ML Estimators via a
> handlePersistence ml.Param
> -----------------------------------------------------------------------------------------------
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
> Issue Type: Improvement
> Components: ML, MLlib
> Affects Versions: 2.2.0
> Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans,
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b)
> check the persistence level of the input dataset but not any of its parents.
> These issues can result in unwanted double-caching of input data & degraded
> performance (see
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm
> should try to cache un-cached input data. {{handlePersistence}} will be
> {{true}} by default, corresponding to existing behavior (always persisting
> uncached input), but users can achieve finer-grained control over input
> persistence by setting {{handlePersistence}} to {{false}}.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]