[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...

Krimit Sun, 30 Apr 2017 08:02:56 -0700

Github user Krimit commented on the issue:

    https://github.com/apache/spark/pull/17673
  
    Thanks for the detailed response @shubhamchopra. 
    I'd like to clarify my point about whether this should be implemented in 
Spark: Spark MlLib is first and foremost a framework for doing ML on large 
datasets where other existing implementations (such as ``scikit-learn``) are 
impractical. A reality of ML is that often increasing the size (and quality) of 
the training data is much more important than tweaking model hyper-parameters. 
Therefore as a community, I think our focus should be more on robustness than 
on "completeness".
    
    While having additional algorithms available for tuning can helpful, I 
would personally be more interested in additions that offer significant and 
clear benefits (such as ``GloVe`` which should be much faster to train and a 
really good fit for Spark due the natural parallelization of the problem).
    
    With that said, I'm not opposed to adding CBOW, so long as we vet it. As 
part of having this merged in, I think ideally we should run an experiment on a 
large-ish dataset (wikipedia?) comparing the two implementations



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...

Reply via email to