Nick Pentreath created SPARK-13968:
--------------------------------------

             Summary: User MurmurHash in for feature hashing
                 Key: SPARK-13968
                 URL: https://issues.apache.org/jira/browse/SPARK-13968
             Project: Spark
          Issue Type: Sub-task
          Components: ML, MLlib
            Reporter: Nick Pentreath
            Priority: Minor


Typically feature hashing is done on strings, i.e. feature names (or in the 
case of raw feature indexes, either the string representation of the numerical 
index can be used, or the index used "as-is" and not hashed).

It is common to use a well-distributed hash function such as MurmurHash3. This 
is the case in e.g. 
[Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].

Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to