Github user shubhamchopra commented on the issue:

    https://github.com/apache/spark/pull/17673
  
    The [original paper](https://arxiv.org/abs/1301.3781) proposed two model 
architectures for generating word embeddings, Continuous Skip-Gram model and 
continuous Bag-of-words model. Spark ML currently only implements the SkipGram 
model. This PR adds the continuous bag of words model. As such the models 
compete with each other, and this implementation would give users options to 
settle on one which suits their data best.
    
    The implementation is based largely on the [original C 
implementation](https://code.google.com/archive/p/word2vec/). I implemented 
this using Negative Sampling, as that was shown to have good performance 
[here](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
 I tried to vectorize operations using BLAS where possible.
     
    I don't understand what you mean by "MLP" implementation. Can you please 
clarify? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to