srowen commented on issue #26722: [SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large URL: https://github.com/apache/spark/pull/26722#issuecomment-560191791 Hm, that value isn't negative though, just very small. The next line, perhaps accidentally, would handle negative values: `if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001` `starting_alpha` is `learningRate` in the Spark code, so that much looks OK. But yes the update rule looks different. `numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations` is a heuristic to estimate the total number of words processed by all partitions. The denominator looks like it's the same though. I would guess this could be negative but then choosing a very small alpha is "OK". I'm not sure that's the issue. I'm kind of wondering about this line in the C code: https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L448 I don't quite see its equivalent here. `syn0` is basically used for `neu1`, but it's missing some normalization by `cw`, which is I believe `2 * windowSize + 1 - 2 * b` here. That's up to a factor of about 9 if windowSize is 4. That feeds, I think, directly into the size of `g` as it makes the magnitude of the dot product that feeds `f` a lot larger. What I don't really understand is why it would be 'triggered' by the number of partitions rather than iterations here, or why it doesn't seem to show up otherwise. It's possible that it's really iterations driving this, and numPartitions isn't 'helping. Hm, what about sticking in a normalization by that factor above as a hack to see what happens?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
