srowen commented on issue #26722: [SPARK-24666][ML] Fix infinity vectors 
produced by Word2Vec when numIterations are large
URL: https://github.com/apache/spark/pull/26722#issuecomment-560191791
 
 
   Hm, that value isn't negative though, just very small. The next line, 
perhaps accidentally, would handle negative values: `if (alpha < learningRate * 
0.0001) alpha = learningRate * 0.0001` `starting_alpha` is `learningRate` in 
the Spark code, so that much looks OK.
   
   But yes the update rule looks different. `numPartitions * wordCount.toDouble 
+ numWordsProcessedInPreviousIterations` is a heuristic to estimate the total 
number of words processed by all partitions. The denominator looks like it's 
the same though. I would guess this could be negative but then choosing a very 
small alpha is "OK". I'm not sure that's the issue.
   
   I'm kind of wondering about this line in the C code:
   https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L448
   
   I don't quite see its equivalent here. `syn0` is basically used for `neu1`, 
but it's missing some normalization by `cw`, which is I believe `2 * windowSize 
+ 1 - 2 * b` here. That's up to a factor of about 9 if windowSize is 4. That 
feeds, I think, directly into the size of `g` as it makes the magnitude of the 
dot product that feeds `f` a lot larger.
   
   What I don't really understand is why it would be 'triggered' by the number 
of partitions rather than iterations here, or why it doesn't seem to show up 
otherwise. It's possible that it's really iterations driving this, and 
numPartitions isn't 'helping.
   
   Hm, what about sticking in a normalization by that factor above as a hack to 
see what happens?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to