[GitHub] spark pull request #19991: [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to...

MLnick Mon, 18 Dec 2017 01:28:15 -0800

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19991#discussion_r157433642
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -117,15 +128,28 @@ class FeatureHasher(@Since("2.3.0") override val uid: 
String) extends Transforme
       @Since("2.3.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getCategoricalCols: Array[String] = $(categoricalCols)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setCategoricalCols(value: Array[String]): this.type = 
set(categoricalCols, value)
    +
       @Since("2.3.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         val hashFunc: Any => Int = OldHashingTF.murmur3Hash
         val n = $(numFeatures)
         val localInputCols = $(inputCols)
    +    val catCols = if (isSet(categoricalCols)) {
    +      $(categoricalCols).toSet
    +    } else {
    +      Set[String]()
    +    }
     
         val outputSchema = transformSchema(dataset.schema)
         val realFields = outputSchema.fields.filter { f =>
    -      f.dataType.isInstanceOf[NumericType]
    +      f.dataType.isInstanceOf[NumericType] && !catCols.contains(f.name)
    --- End diff --
    
    Not sure I follow your question here. 
    
    The existing behavior is not to skip any fields. `realFields` is all the 
numeric fields (i.e. not `String` or `Boolean`), which are then hashed as per 
the `hashFeatures` udf below. Categoricals are hashed slightly differently.
    
    This change just sets `realFields` to those numeric fields that _are not_ 
in `categoricalCols`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19991: [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to...

Reply via email to