GitHub user yinxusen opened a pull request:
https://github.com/apache/spark/pull/5596
[ML][SPARK-6529] Add Word2Vec transformer
See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).
There are some notes:
1. I add `learningRate` in sharedParams since it is a common parameter for
ML algorithms.
2. We will not support transform of finding synonyms from a `Vector`, which
will support in further JIRA issues.
3. Word2Vec is different with other ML models that its training set and
transformed set are different. Its training set is an `RDD[Iterable[String]]`
which represents documents, but the transformed set we want is an `RDD[String]`
that represents unique words. So you have to switch your `inputCol` in these
two stages.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yinxusen/spark SPARK-6529
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5596.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5596
----
commit 6a514f16fd12f7b7dbf9fe33e442b33958d1cd20
Author: Xusen Yin <[email protected]>
Date: 2015-04-19T09:22:33Z
add word2vec transformer
commit 02767fb59d2b3583000a15d3a337b6a41c6be71f
Author: Xusen Yin <[email protected]>
Date: 2015-04-20T04:43:48Z
add shared params
commit fe3afe99214f72517a3a695063dc710110f8dd31
Author: Xusen Yin <[email protected]>
Date: 2015-04-20T06:53:29Z
add test suite and pass it
commit e29680a091806bcb3ee6c9b8a44e407b4bd040fa
Author: Xusen Yin <[email protected]>
Date: 2015-04-20T15:34:09Z
fix errors
commit 618abd0cc3727896448c227ccccac351a0e592a6
Author: Xusen Yin <[email protected]>
Date: 2015-04-20T15:57:37Z
refine comments
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]