[
https://issues.apache.org/jira/browse/SPARK-25365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-25365.
----------------------------------
Resolution: Invalid
> a better way to handle vector index and sparsity in FeatureHasher
> implementation ?
> ----------------------------------------------------------------------------------
>
> Key: SPARK-25365
> URL: https://issues.apache.org/jira/browse/SPARK-25365
> Project: Spark
> Issue Type: Question
> Components: ML
> Affects Versions: 2.3.1
> Reporter: Vincent
> Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on
> the hashed value is used to determine the vector index, it's suggested to use
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation:
> # Cannot get the feature name back by its index after featureHasher
> transform, for example. when getting feature importance from decision tree
> training followed by a FeatureHasher
> # when index conflict, which is a great chance to happen especially when
> 'numFeature' is relatively small, its value would be updated with the sum of
> current and old value, ie, the value of the conflicted feature vector would
> be change by this module.
> # to avoid confliction, we should set the 'numFeature' with a large number,
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it
> might or might not be an issue for others as well, we'd like to hear from the
> community.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]