[ 
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201401#comment-15201401
 ] 

Yanbo Liang commented on SPARK-13968:
-------------------------------------

Sure, I will do performance comparison firstly.

> Use MurmurHash3 for hashing String features
> -------------------------------------------
>
>                 Key: SPARK-13968
>                 URL: https://issues.apache.org/jira/browse/SPARK-13968
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Nick Pentreath
>            Assignee: Yanbo Liang
>            Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the 
> case of raw feature indexes, either the string representation of the 
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3. 
> This is the case in e.g. 
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
> MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to