viirya commented on issue #26722: [SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large URL: https://github.com/apache/spark/pull/26722#issuecomment-560187625 That's good point! I checked the alpha value during fitting for 5 partitions. At the end of fitting, the alpha becomes significantly negative value like 3.131027051017643E-6. I think current alpha value is also not correctly computed. Originally, the alpha is updated like https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L397: ``` alpha = starting_alpha * (1 - word_count_actual / (real)(iter * train_words + 1)); ``` In Spark's Word2Vec, it is updated as: ``` alpha = learningRate * (1 - (numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations) / totalWordsCounts) ``` Here by multiplying numPartitions, we may update to significantly negative alpha.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
