[GitHub] [spark] gggsanna commented on pull request #5300: [SPARK-6617][MLLib] Word2Vec is nondeterministic

GitBox Wed, 03 Jun 2020 14:30:26 -0700


gggsanna commented on pull request #5300:
URL: https://github.com/apache/spark/pull/5300#issuecomment-638472045

Sorry, I see that this has not been fixed yet. I'm not fluent in Scala, but
I see at least two things causing non-determinism:

At line 347 there is a "repartition"
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L347
(same line as the original issue). Replacing it with
repartitionAndSortWithinPartitions would help making the fit deterministic,
without the cost of sorting the whole RDD.

On top of that, when the vocabulary is learned in the learnVocab method
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L198),
the dataset of counts of word occurrences is collected and then sorted by
count. This is also non-deterministic, since there are lots of ties which are
solved according to the collect order. Since this determines the integer that
is associated with each word, this alone is enough to make the repartition
(with or without sort) at line 347 non-deterministic.

Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] gggsanna commented on pull request #5300: [SPARK-6617][MLLib] Word2Vec is nondeterministic

Reply via email to