Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/19991#discussion_r157433642
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
@@ -117,15 +128,28 @@ class FeatureHasher(@Since("2.3.0") override val uid:
String) extends Transforme
@Since("2.3.0")
def setOutputCol(value: String): this.type = set(outputCol, value)
+ /** @group getParam */
+ @Since("2.3.0")
+ def getCategoricalCols: Array[String] = $(categoricalCols)
+
+ /** @group setParam */
+ @Since("2.3.0")
+ def setCategoricalCols(value: Array[String]): this.type =
set(categoricalCols, value)
+
@Since("2.3.0")
override def transform(dataset: Dataset[_]): DataFrame = {
val hashFunc: Any => Int = OldHashingTF.murmur3Hash
val n = $(numFeatures)
val localInputCols = $(inputCols)
+ val catCols = if (isSet(categoricalCols)) {
+ $(categoricalCols).toSet
+ } else {
+ Set[String]()
+ }
val outputSchema = transformSchema(dataset.schema)
val realFields = outputSchema.fields.filter { f =>
- f.dataType.isInstanceOf[NumericType]
+ f.dataType.isInstanceOf[NumericType] && !catCols.contains(f.name)
--- End diff --
Not sure I follow your question here.
The existing behavior is not to skip any fields. `realFields` is all the
numeric fields (i.e. not `String` or `Boolean`), which are then hashed as per
the `hashFeatures` udf below. Categoricals are hashed slightly differently.
This change just sets `realFields` to those numeric fields that _are not_
in `categoricalCols`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]