GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/6742
[Spark-8169] [ML] Add StopWordsRemover as a transformer
jira: https://issues.apache.org/jira/browse/SPARK-8169
stop words: http://en.wikipedia.org/wiki/Stop_words
StopWordsRemover takes a string array column and outputs a string array
column with all defined stop words removed. The transformer should also come
with a standard set of stop words as default.
Currently I used a minimum stop words set since on some
[case](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html),
small set of stop words is preferred.
ASCII char has been tested, Yet I cannot check it in due to style check.
Further thought,
1. Maybe I should use OpenHashSet. Is it recommended?
2. Currently I leave the null in input array untouched, i.e. Array(null,
null) => Array(null, null).
3. If the current stop words set looks too limited, any suggestion for
replacement? We can have something similar to the one in
[SKlearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark stopwords
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6742.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6742
----
commit b3aa957a2abf92fdc5b0389d79bfec9389dcbaf8
Author: Yuhao Yang <[email protected]>
Date: 2015-06-10T10:50:05Z
add stopWordsRemover
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]