Nick Pentreath created SPARK-13968:
--------------------------------------
Summary: User MurmurHash in for feature hashing
Key: SPARK-13968
URL: https://issues.apache.org/jira/browse/SPARK-13968
Project: Spark
Issue Type: Sub-task
Components: ML, MLlib
Reporter: Nick Pentreath
Priority: Minor
Typically feature hashing is done on strings, i.e. feature names (or in the
case of raw feature indexes, either the string representation of the numerical
index can be used, or the index used "as-is" and not hashed).
It is common to use a well-distributed hash function such as MurmurHash3. This
is the case in e.g.
[Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
Currently Spark's {{HashingTF}} uses the object's hash code. Look at using
MurmurHash3 (at least for {{String}} which is the common case).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]