Github user hhbyyh commented on the pull request:
https://github.com/apache/spark/pull/7388#issuecomment-131104897
Thanks @jkbradley for the update. Copy the comment of Joseph from
https://github.com/hhbyyh/spark/pull/3 to here:
Updates:
Renamed "minCount" to "minTokenCount"
Added "minTermFreq" back, including unit test
Moved all Params to include in both Estimator and Model so that they can be
viewed in either.
New thoughts:
Previously, we had decided to call "minTermFreq" a "frequency" even though
it is a count. That was to follow other things such as sklearn and "TF-IDF."
Now, I would like to rename it to "minTermCount." Yuhao got me thinking
along these lines, and it actually matches other libraries to some extent:
sklearn calls it "min_tf," but it accepts both integers (counts) and
doubles (fractions or frequencies). Since we cannot take both the way Params
are set up, the best we could do is have a minTermCount Param and a minTermFreq
Param. Frequency might actually be a better way to specify this threshold.
TF-IDF arguably uses fractional frequencies (not counts)...depending on
whether "TF-IDF" refers to the value or the normalizer.
Proposal: I propose we do 1 of the following:
(A) Rename "minTermFreq" to "minTermCount"
(B) Change "minTermFreq" to take a Double fraction
(C) Remove "minTermFreq" for now pending further discussion
CC: @hhbyyh @mengxr What do you think?
@hhbyyh Feel free to merge, take parts, or modify as you see fit. It may be
easiest if you go ahead and merge/modify so that @mengxr can view your full PR.
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]