Github user thvasilo commented on the issue:
https://github.com/apache/flink/pull/2735
Thank you for your contribution Kalman!
I just took a brief look, this is a big PR so will probably take some time
to review.
For now a few things that jump to mind:
* We'll need to add docs for the algorithm, which should be example heavy.
[Here's](https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/standard_scaler.html)
a simple example for another pre-processing algorithm. I see you already have
extensive ScalaDoc's we could prolly replicate those in the docs.
* Have you tested it in a relatively large scale dataset? Ideally in a
distributed setting where the input files are on HDFS. This way we test the
scalability of the implementation, and problems usually arise.
* Have you compared the output with a reference implementation? My
knowledge of word2vec is not very deep but as far as I understand the output is
non-deterministic, so we would need some sort of proxy to evaluate the
integrated correctness of the implementation.
* Finally I see this introduces a new nlp package. I'm not sure how to
treat this (and relevant algorithms, say TF-IDF), as they are not necessarily
NLP specific, even though they stem from the field you could treat any sequence
of objects as a "sentence" and embed them. I would favor including them as
pre-processing steps and hence inheriting from the `Transformer` interface,
perhaps by having a feature pre-processing package.
Regards,
Theodore
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---