GitHub user rnowling opened a pull request:
https://github.com/apache/spark/pull/2494
[SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF
This PR for (SPARK-3614)[https://issues.apache.org/jira/browse/SPARK-3614]
adds functionality for filtering out terms which do not appear in at least a
minimum number of terms.
This is implemented using a minimumOccurence parameter (default 0). When
terms' document frequencies are less than minimumOccurence, their IDFs are set
to 0, just like when the DF is 0. As a result, the TF-IDFs for the terms are
found to be 0, as if the terms were not present in the documents.
This PR makes the following changes:
* Add a minimumOccurence parameter to the IDF and
DocumentFrequencyAggregator classes.
* Create a parameter-less constructor for IDF with a default
minimumOccurence value of 0 to remain backwards-compatibility with the original
IDF API.
* Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
* Updated the MLLib Feature Extraction programming guide to describe the
new feature
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rnowling/spark spark-3614-idf-filter
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2494.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2494
----
commit c0cc64380e906f08a0f8abbfd5c2ccd3c0333bd5
Author: RJ Nowling <[email protected]>
Date: 2014-09-22T20:53:56Z
Add minimumOccurence filtering to IDF
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]