[GitHub] spark pull request: [SPARK-14899] [ML] [PySpark] Remove spark.ml H...

MLnick Tue, 26 Apr 2016 08:30:06 -0700

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12702#discussion_r61107574
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
    @@ -31,12 +31,9 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
     /**
      * :: Experimental ::
      * Maps a sequence of terms to their term frequencies using the hashing 
trick.
    - * Currently we support two hash algorithms: "murmur3" (default) and 
"native".
    + * Currently we support one hash algorithms "murmur3" which is also the 
default option.
    --- End diff --
    
    (Somewhat related to this PR) By the way, it is usual to select a vector 
size that is a power of 2, e.g. [scikit-learn doc mentions 
this](http://scikit-learn.org/stable/modules/feature_extraction.html#implementation-details):
    > Since a simple modulo is used to transform the hash function to a column 
index, it is advisable to use a power of two as the n_features parameter; 
otherwise the features will not be mapped evenly to the columns.
    
    We should add that to the doc



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14899] [ML] [PySpark] Remove spark.ml H...

Reply via email to