[
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Pentreath updated SPARK-13968:
-----------------------------------
Summary: User MurmurHash3 for hashing String features (was: User
MurmurHash for feature hashing)
> User MurmurHash3 for hashing String features
> --------------------------------------------
>
> Key: SPARK-13968
> URL: https://issues.apache.org/jira/browse/SPARK-13968
> Project: Spark
> Issue Type: Sub-task
> Components: ML, MLlib
> Reporter: Nick Pentreath
> Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the
> case of raw feature indexes, either the string representation of the
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3.
> This is the case in e.g.
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using
> MurmurHash3 (at least for {{String}} which is the common case).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]