srowen commented on a change in pull request #20146: [SPARK-11215][ML] Add multiple columns support to StringIndexer URL: https://github.com/apache/spark/pull/20146#discussion_r245850035
########## File path: docs/ml-guide.md ########## @@ -110,6 +110,16 @@ and the migration guide below will explain all changes between releases. * `OneHotEncoder` which is deprecated in 2.3, is removed in 3.0 and `OneHotEncoderEstimator` is now renamed to `OneHotEncoder`. +### Changes of behavior + +* [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215): + In Spark 2.4 and previous versions, when specifying `frequencyDesc` or `frequencyAsc` as + `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of + strings is undefined. Since Spark 3.0, the strings with equal frequency are further + sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding multiple + columns. Because of this change, `StringIndexerModel`'s public constructor `def this(uid: String, labels: Array[String])` Review comment: _shrug_ It seems more common to transform one column than many. You mean that the constructor is rarely used vs the setters? I agree. I could see deprecating all constructors that fit this pattern for 3.0, across other classes too, if so. It just seemed pretty easy to keep the constructor right here, esp. as it wasn't deprecated. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
