gggsanna commented on pull request #5300: URL: https://github.com/apache/spark/pull/5300#issuecomment-638472045
Sorry, I see that this has not been fixed yet. I'm not fluent in Scala, but I see at least two things causing non-determinism: At line 347 there is a "repartition" https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L347 (same line as the original issue). Replacing it with repartitionAndSortWithinPartitions would help making the fit deterministic, without the cost of sorting the whole RDD. On top of that, when the vocabulary is learned in the learnVocab method (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L198), the dataset of counts of word occurrences is collected and then sorted by count. This is also non-deterministic, since there are lots of ties which are solved according to the collect order. Since this determines the integer that is associated with each word, this alone is enough to make the repartition (with or without sort) at line 347 non-deterministic. Thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
