Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/10152#issuecomment-162526464
Pinging @mengxr @MechCoder @jkbradley (and I think @Ishiihara was the
original author of Word2Vec?)
So let's focus this PR in on making the max sentence size configurable, if
this is desirable?
Looking a bit deeper, the sentence structure of the input is essentially
discarded in
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L273.
This dates back to the original implementation, and it does match the original
Google implementation that treats end-of-line as a word boundary, then imposes
a `MAX_SENTENCE_LENGTH` of 1000 when processing the word stream.
It's interesting to note that e.g. Gensim's implementation respects the
sentence structure of the input data
(https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py#L120).
Deeplearning4j seems to do the same.
It does seem a little strange to me thinking about it to discard sentence
boundaries. It does make sense for very large text corpuses. But Word2Vec is
more general than that, and can be applied e.g. in recommendation settings,
where the boundary between "sentences" as, say, a "user activity history", is
more patently "discontinuos".
Thoughts? On the face of it we can leave the implementation as is (as it is
true to the original), optionally making the max sentence length a configurable
param. Or we can look at using the "sentence" structure of the input data
(perhaps making the behaviour configurable between this and the original impl).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]