[jira] [Updated] (SPARK-32973) FeatureHasher does not check categoricalCols in inputCols

zhengruifeng (Jira) Wed, 23 Sep 2020 03:25:12 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


zhengruifeng updated SPARK-32973:
---------------------------------
    Description: 
doc related to {{categoricalCols}}:
{code:java}
Numeric columns to treat as categorical features. By default only string and 
boolean columns are treated as categorical, so this param can be used to 
explicitly specify the numerical columns to treat as categorical. Note, the 
relevant columns must also be set in inputCols. {code}
 

However, the check to make sure {{categoricalCols}} in {{inputCols}} was never 
implemented:

for example, in 2.4.7 and current master(3.1.0):
{code:java}
scala> import org.apache.spark.ml.feature._
import org.apache.spark.ml.feature._
scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.linalg.{Vector, Vectors}

scala> val df = Seq((2.0, 1, "foo"),(3.0, 2, "bar")).toDF("real", "int", 
"string")
df: org.apache.spark.sql.DataFrame = [real: double, int: int ... 1 more field]
scala> val n = 100
n: Int = 100
scala> val hasher = new FeatureHasher().setInputCols("int", 
"string").setCategoricalCols(Array("real")).setOutputCol("features").setNumFeatures(n)
 
hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_fbe05968b33f
scala> hasher.transform(df).show
+----+---+------+--------------------+
|real|int|string|            features|
+----+---+------+--------------------+
| 2.0|  1|   foo|(100,[2,39],[1.0,...|
| 3.0|  2|   bar|(100,[2,42],[2.0,...|
+----+---+------+--------------------+

{code}
 

CategoricalCols "real" is not in inputCols ("int", "string").

 

I think there are two options:

1, remove this comment  "Note, the relevant columns must also be set in 
inputCols. ", since this requirement seems unnecessary;

2, add a check to make sure all CategoricalCols are in inputCols.

 

 

  was:
doc related to {{categoricalCols}}:
{code:java}
Numeric columns to treat as categorical features. By default only string and 
boolean columns are treated as categorical, so this param can be used to 
explicitly specify the numerical columns to treat as categorical. Note, the 
relevant columns must also be set in inputCols. {code}
 

However, the check to make sure {{categoricalCols}} in {{inputCols}} was never 
implemented:

for example, in 2.4.7 and current master(3.1.0):
{code:java}
scala> import org.apache.spark.ml.feature._
import org.apache.spark.ml.feature._
scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.linalg.{Vector, Vectors}

scala> val df = Seq((2.0, 1, "foo"),(3.0, 2, "bar")).toDF("real", "int", 
"string")
df: org.apache.spark.sql.DataFrame = [real: double, int: int ... 1 more field]
scala> val n = 100
n: Int = 100
scala> val hasher = new FeatureHasher().setInputCols("int", 
"string").setCategoricalCols(Array("real")).setOutputCol("features").setNumFeatures(n)
 
hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_fbe05968b33f
scala> hasher.transform(df).show
+----+---+------+--------------------+
|real|int|string|            features|
+----+---+------+--------------------+
| 2.0|  1|   foo|(100,[2,39],[1.0,...|
| 3.0|  2|   bar|(100,[2,42],[2.0,...|
+----+---+------+--------------------+

{code}
 

CategoricalCols "real" is not in inputCols ("int", "string").

 

I think there are two options:

1, remove this comment  "Note, the relevant columns must also be set in 
inputCols. ", since this check seems un

 

 


> FeatureHasher does not check categoricalCols in inputCols
> ---------------------------------------------------------
>
>                 Key: SPARK-32973
>                 URL: https://issues.apache.org/jira/browse/SPARK-32973
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.3.0, 2.4.0, 3.0.0, 3.1.0
>            Reporter: zhengruifeng
>            Priority: Trivial
>
> doc related to {{categoricalCols}}:
> {code:java}
> Numeric columns to treat as categorical features. By default only string and 
> boolean columns are treated as categorical, so this param can be used to 
> explicitly specify the numerical columns to treat as categorical. Note, the 
> relevant columns must also be set in inputCols. {code}
>  
> However, the check to make sure {{categoricalCols}} in {{inputCols}} was 
> never implemented:
> for example, in 2.4.7 and current master(3.1.0):
> {code:java}
> scala> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.feature._
> scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}
> scala> val df = Seq((2.0, 1, "foo"),(3.0, 2, "bar")).toDF("real", "int", 
> "string")
> df: org.apache.spark.sql.DataFrame = [real: double, int: int ... 1 more field]
> scala> val n = 100
> n: Int = 100
> scala> val hasher = new FeatureHasher().setInputCols("int", 
> "string").setCategoricalCols(Array("real")).setOutputCol("features").setNumFeatures(n)
>  
> hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_fbe05968b33f
> scala> hasher.transform(df).show
> +----+---+------+--------------------+
> |real|int|string|            features|
> +----+---+------+--------------------+
> | 2.0|  1|   foo|(100,[2,39],[1.0,...|
> | 3.0|  2|   bar|(100,[2,42],[2.0,...|
> +----+---+------+--------------------+
> {code}
>  
> CategoricalCols "real" is not in inputCols ("int", "string").
>  
> I think there are two options:
> 1, remove this comment  "Note, the relevant columns must also be set in 
> inputCols. ", since this requirement seems unnecessary;
> 2, add a check to make sure all CategoricalCols are in inputCols.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-32973) FeatureHasher does not check categoricalCols in inputCols

Reply via email to