[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611628#comment-16611628
 ] 

Vincent commented on SPARK-25412:
---------------------------------

[~nick.pentre...@gmail.com] thanks.

> FeatureHasher would change the value of output feature
> ------------------------------------------------------
>
>                 Key: SPARK-25412
>                 URL: https://issues.apache.org/jira/browse/SPARK-25412
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.1
>            Reporter: Vincent
>            Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to