Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/12702#discussion_r61107574
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
---
@@ -31,12 +31,9 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
/**
* :: Experimental ::
* Maps a sequence of terms to their term frequencies using the hashing
trick.
- * Currently we support two hash algorithms: "murmur3" (default) and
"native".
+ * Currently we support one hash algorithms "murmur3" which is also the
default option.
--- End diff --
(Somewhat related to this PR) By the way, it is usual to select a vector
size that is a power of 2, e.g. [scikit-learn doc mentions
this](http://scikit-learn.org/stable/modules/feature_extraction.html#implementation-details):
> Since a simple modulo is used to transform the hash function to a column
index, it is advisable to use a power of two as the n_features parameter;
otherwise the features will not be mapped evenly to the columns.
We should add that to the doc
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]