GitHub user mengxr opened a pull request:
https://github.com/apache/spark/pull/12843
[SPARK-14050] [ML] Add multiple languages support and additional methods
for Stop Words Remover
## What changes were proposed in this pull request?
This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc
## How was this patch tested?
Unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mengxr/spark SPARK-14050
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12843.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12843
----
commit c126c87818eb06aa5c2ac23b362d504f342c72b0
Author: Burak Köse <[email protected]>
Date: 2016-03-14T22:22:02Z
add language files
commit 8248579ec27a40de98fe1f3020d947c478981ebc
Author: Burak Köse <[email protected]>
Date: 2016-03-14T22:23:32Z
add multi-language support for stop words
commit 2c7b73df14d2d292eff88d7f3c358d29f82f6122
Author: Burak Köse <[email protected]>
Date: 2016-03-14T22:24:41Z
add new tests for StopWordsRemover
commit 43e5cf54d4f9583f8b90291b3c7603ac4e7fab2a
Author: Burak Köse <[email protected]>
Date: 2016-03-21T23:41:47Z
adjust resource files
commit a43039223a28b308ae1c14d33be5e5a1df382ed6
Author: Burak Köse <[email protected]>
Date: 2016-03-21T23:43:15Z
adjust resource files
commit 28ee249f676971371d11d16c2912bbf81e045269
Author: Burak Köse <[email protected]>
Date: 2016-03-21T23:46:42Z
fix stopwords bug
commit 6d215b31a205c4a79e8cc0ef6963d239941e80ff
Author: Burak Köse <[email protected]>
Date: 2016-03-21T23:53:06Z
update comment lines
commit 6deceecf88c66b3293698aca5d7306c2aa02e2e0
Author: Burak Köse <[email protected]>
Date: 2016-03-22T16:24:38Z
update stop words list
commit 41cd25815af3baa8fe9ed9336812f436d7ed7bd5
Author: Burak Köse <[email protected]>
Date: 2016-03-22T16:25:36Z
update stopwordsremover
commit 4d1812aae64b0b15312940b1a6c42e19f9686480
Author: Burak KOSE <[email protected]>
Date: 2016-03-22T17:35:37Z
fix test case bug
After updating English stop words list, "d" is a stop word.
commit a30862231c3944c55c96cc94e162f61614aee6d5
Author: Burak Köse <[email protected]>
Date: 2016-03-22T21:45:48Z
fix encoding
commit 2e7c54e5c17e7c5672a43ffc28acb207e94bf28a
Author: Burak Köse <[email protected]>
Date: 2016-03-23T01:42:36Z
fix pyspark test
commit 7efda40e39663deef0b0884a7bfca13b5d10d706
Author: Burak Köse <[email protected]>
Date: 2016-03-23T16:51:48Z
add licence for stop words list
commit a066e8b34ec4824fa26a1e306e197b66400f5ccb
Author: Burak Köse <[email protected]>
Date: 2016-03-24T17:12:20Z
change licence to license
commit d0f43ace892332dfb3ad25d0ef1d0c0451540e5c
Author: Burak Köse <[email protected]>
Date: 2016-03-25T16:23:37Z
add readme for stopwords list
commit c017ee235287554e28281d1691d0188e358b7ad8
Author: Burak Köse <[email protected]>
Date: 2016-03-25T16:26:23Z
merge StopWords into StopWordsRemover
commit 55191ce1f449bed55884a4481071b0fc5ee776a9
Author: Burak Köse <[email protected]>
Date: 2016-03-25T16:27:59Z
add python stopwords support for language selection
commit 789342f2d26759db180868a9f59b02c8f85cc835
Author: Burak Köse <[email protected]>
Date: 2016-03-25T16:28:48Z
add new tests for stopwords
commit 4f97c8d5a088595a23f7ec848c793d05fc052d79
Author: Xiangrui Meng <[email protected]>
Date: 2016-05-02T15:26:29Z
Merge remote-tracking branch 'apache/master' into SPARK-14050
commit 713d4d5e81b2194efa640ec46fa16c56049c00f5
Author: Xiangrui Meng <[email protected]>
Date: 2016-05-02T15:51:31Z
minor updates
commit 1bd69af46f43d25518f6c5e01e2ee7fc5c279a03
Author: Xiangrui Meng <[email protected]>
Date: 2016-05-02T16:05:52Z
fix python tests and add a TODO
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]