Github user Krimit commented on the issue:
https://github.com/apache/spark/pull/17673
Thanks for the detailed response @shubhamchopra.
I'd like to clarify my point about whether this should be implemented in
Spark: Spark MlLib is first and foremost a framework for doing ML on large
datasets where other existing implementations (such as ``scikit-learn``) are
impractical. A reality of ML is that often increasing the size (and quality) of
the training data is much more important than tweaking model hyper-parameters.
Therefore as a community, I think our focus should be more on robustness than
on "completeness".
While having additional algorithms available for tuning can helpful, I
would personally be more interested in additions that offer significant and
clear benefits (such as ``GloVe`` which should be much faster to train and a
really good fit for Spark due the natural parallelization of the problem).
With that said, I'm not opposed to adding CBOW, so long as we vet it. As
part of having this merged in, I think ideally we should run an experiment on a
large-ish dataset (wikipedia?) comparing the two implementations
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]