Github user kalmanchapman commented on the issue:
https://github.com/apache/flink/pull/2735
Hey Theodore,
Thanks for taking a look at my PR!
- I'll add docs shortly, per the examples you posted.
- I've tested against datasets in the hundreds-of-megabytes size (using the
preprocessed wikipedia articles available
[here](http://mattmahoney.net/dc/textdata)) in a distributed, HDFS supported
environment. The implementation worked well as the scale of the data increased
- although I was experiencing some frustrating memory issues as I increased the
number of iterations performed.
- I can show that the vectors generated show good results along the lines
of the original paper - that they show semantic similarity in line with cosine
similarity and that difference vectors can be used to create 'analogy'
relationships that make sense. But you're right that it's non-deterministic and
surveying how it's tested in other libraries is inconclusive. I've included
some toy datasets in the integration tests that show good results and exercise
these qualities.
- I know what you mean about the new package. I included it because the
feature requests was specifically for Word2Vec. But - similar to your
suggestion - the class in the nlp package is really just a wrapper around a
generic embedding algorithm that can perform on any data that is
word-in-sentence-like. The ContextEmbedder class, in the optimization package,
is where the actual embedding is occurring.
That said, optimization might not be the right home either (although we are
optimizing toward some minima)
Best,
Kalman
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---