[
https://issues.apache.org/jira/browse/SPARK-26172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-26172:
------------------------------------
Assignee: (was: Apache Spark)
> Unify String Params' case-insensitivity in ML
> ---------------------------------------------
>
> Key: SPARK-26172
> URL: https://issues.apache.org/jira/browse/SPARK-26172
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.0.0
> Reporter: zhengruifeng
> Priority: Major
>
> For now, there are three ways to deal with case-insensitivity in ML:
> 1, support case-insensitivity, e.g. {{LogisticRegression}};
> 2, support case-insensitivity, but with getter returning the lower case value
> (not the value passed to setter), e.g. {{ALS}},{{DecisionTreeClassifier}};
> 3, do not support case-insensitivity, e.g. {{NaiveBayes}}
>
> This situation result in confusion in usage.
> I think we should choose the *first* way to support case-insensitivity of all
> non-columnName string params, including:
> * LogisticRegression: family
> * MultilayerPerceptronClassifier: {{solver}}
> * NaiveBayes: modelType
> * DecisionTreeClassifier: impurity
> * RandomForestClassifier: featureSubsetStrategy, impurity
> * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}}
> * {{}}
> * LinearRegression: solver, loss
> * GeneralizedLinearRegression: family, link, solver
> * DecisionTreeRegressor: impurity
> * RandomForestRegressor: featureSubsetStrategy, impurity
> * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}}
> * {{}}
> * {\{KMeans: }}initMode
> * LDA: optimizer
> * PowerIterationClustering\{{: }}initMode
> *
> * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
> *
> * Bucketizer: handleInvalid
> * ChiSqSelector: selectorType
> * Imputer: strategy
> * QuantileDiscretizer: handleInvalid
> * RFormula: handleInvalid, stringIndexerOrderType
> * StringIndexer: handleInvalid, stringOrderType
> * VectorAssembler: handleInvalid
> * VectorIndexer: handleInvalid
> * VectorSizeHint: handleInvalid
> * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the
> breaking change*)
> *
> * BinaryClassificationEvaluator: metricName
> * MulticlassClassificationEvaluator: metricName
> * RegressionEvaluator: metricName
> * ClusteringEvaluator: metricName, distanceMeasure
>
>
>
> To to this:
> * methods {{lowerCaseInArray}} and {{upperCaseInArray}} are created in
> {{ParamValidators}} to check case-insensitivity;
> * methods {{{{$$(param: Param[String])}}}} and {{%%(param: Param[String])}}
> are created in trait {{Params}} to lower/upper the param value conveniently,
> and this can minimize the modifications in existing codes, since in many
> cases we only need to change {{$(param)}} to {{$$\{param}}};
> * in *SharedParamsCodeGen*, *handleInvalid* and *{{distanceMeasure}}* are
> updated to use lowerCaseInArray
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]