Github user LowikC commented on the issue: https://github.com/apache/spark/pull/19372 I think the PR is incorrect: - the original C code decreases the learning rate linearly from `starting_alpha` to 0, across all iterations new_alpha = `starting_alpha` * (1 - progress), progress = `word_count_actual` / (numIterations * numWordsPerIteration + 1) Note that `word_count_actual` counts the number of words processed so far, in all iterations. - in the PR, word_count_actual is called `wordCount`, but `wordCount` is reset at the beginning of each iteration (see https://github.com/nzw0301/spark/blob/e2a7d393e141405f658a68f99bc4a1f53816db95/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L365). - the current Spark code is also different: it decreases the learning rate linearly from `learningRate` to 0, in one iteration. At the beginning of the next iteration, the value is again `learningRate` - the "right" formula (to keep the behavior of the original C code) should be: ``` val numWordsToProcess = numIterations * trainWordsCount + 1 val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount val numWordsProcessedInCurrentIteration = numPartitions * wordCount.toDouble val progress = (numWordsProcessedInPreviousIterations + numWordsProcessedInCurrentIteration) / numWordsToProcess alpha = learningRate * (1 - progress) ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org