Github user shubhamchopra commented on the issue:
https://github.com/apache/spark/pull/17673
The [original paper](https://arxiv.org/abs/1301.3781) proposed two model
architectures for generating word embeddings, Continuous Skip-Gram model and
continuous Bag-of-words model. Spark ML currently only implements the SkipGram
model. This PR adds the continuous bag of words model. As such the models
compete with each other, and this implementation would give users options to
settle on one which suits their data best.
The implementation is based largely on the [original C
implementation](https://code.google.com/archive/p/word2vec/). I implemented
this using Negative Sampling, as that was shown to have good performance
[here](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
I tried to vectorize operations using BLAS where possible.
I don't understand what you mean by "MLP" implementation. Can you please
clarify?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]