[
https://issues.apache.org/jira/browse/FLINK-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633492#comment-15633492
]
ASF GitHub Bot commented on FLINK-2094:
---------------------------------------
Github user thvasilo commented on the issue:
https://github.com/apache/flink/pull/2735
Thank you for your contribution Kalman!
I just took a brief look, this is a big PR so will probably take some time
to review.
For now a few things that jump to mind:
* We'll need to add docs for the algorithm, which should be example heavy.
[Here's](https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/standard_scaler.html)
a simple example for another pre-processing algorithm. I see you already have
extensive ScalaDoc's we could prolly replicate those in the docs.
* Have you tested it in a relatively large scale dataset? Ideally in a
distributed setting where the input files are on HDFS. This way we test the
scalability of the implementation, and problems usually arise.
* Have you compared the output with a reference implementation? My
knowledge of word2vec is not very deep but as far as I understand the output is
non-deterministic, so we would need some sort of proxy to evaluate the
integrated correctness of the implementation.
* Finally I see this introduces a new nlp package. I'm not sure how to
treat this (and relevant algorithms, say TF-IDF), as they are not necessarily
NLP specific, even though they stem from the field you could treat any sequence
of objects as a "sentence" and embed them. I would favor including them as
pre-processing steps and hence inheriting from the `Transformer` interface,
perhaps by having a feature pre-processing package.
Regards,
Theodore
> Implement Word2Vec
> ------------------
>
> Key: FLINK-2094
> URL: https://issues.apache.org/jira/browse/FLINK-2094
> Project: Flink
> Issue Type: Improvement
> Components: Machine Learning Library
> Reporter: Nikolaas Steenbergen
> Assignee: Nikolaas Steenbergen
> Priority: Minor
> Labels: ML
>
> implement Word2Vec
> http://arxiv.org/pdf/1402.3722v1.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)