[
https://issues.apache.org/jira/browse/SPARK-32973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen reassigned SPARK-32973:
------------------------------------
Assignee: zhengruifeng
> FeatureHasher does not check categoricalCols in inputCols
> ---------------------------------------------------------
>
> Key: SPARK-32973
> URL: https://issues.apache.org/jira/browse/SPARK-32973
> Project: Spark
> Issue Type: Improvement
> Components: Documentation, ML
> Affects Versions: 2.3.0, 2.4.0, 3.0.0, 3.1.0
> Reporter: zhengruifeng
> Assignee: zhengruifeng
> Priority: Trivial
>
> doc related to {{categoricalCols}}:
> {code:java}
> Numeric columns to treat as categorical features. By default only string and
> boolean columns are treated as categorical, so this param can be used to
> explicitly specify the numerical columns to treat as categorical. Note, the
> relevant columns must also be set in inputCols. {code}
>
> However, the check to make sure {{categoricalCols}} in {{inputCols}} was
> never implemented:
> for example, in 2.4.7 and current master(3.1.0):
> {code:java}
> scala> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.feature._
> scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}
> scala> val df = Seq((2.0, 1, "foo"),(3.0, 2, "bar")).toDF("real", "int",
> "string")
> df: org.apache.spark.sql.DataFrame = [real: double, int: int ... 1 more field]
> scala> val n = 100
> n: Int = 100
> scala> val hasher = new FeatureHasher().setInputCols("int",
> "string").setCategoricalCols(Array("real")).setOutputCol("features").setNumFeatures(n)
>
> hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_fbe05968b33f
> scala> hasher.transform(df).show
> +----+---+------+--------------------+
> |real|int|string| features|
> +----+---+------+--------------------+
> | 2.0| 1| foo|(100,[2,39],[1.0,...|
> | 3.0| 2| bar|(100,[2,42],[2.0,...|
> +----+---+------+--------------------+
> {code}
>
> CategoricalCols "real" is not in inputCols ("int", "string").
>
> I think there are two options:
> 1, remove this comment "Note, the relevant columns must also be set in
> inputCols. ", since this requirement seems unnecessary;
> 2, add a check to make sure all CategoricalCols are in inputCols.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]