srowen commented on issue #26722: [SPARK-24666][ML] Fix infinity vectors 
produced by Word2Vec when numIterations are large
URL: https://github.com/apache/spark/pull/26722#issuecomment-560012547
 
 
   Hm, I don't think it's only cosine similarity that matters; these are often 
used in general in embeddings for neural nets or something. Changing the norms 
individually changes their relative magnitude. It doesn't mean the answer is 
wrong per se, just not clear losing that info is not costing something - indeed 
in training too. 
   
   I do think it's more valid to divide through everything by a constant, which 
could arbitrarily be the number of partitions, or words. I'd love to find a 
more reliable reference for that kind of thing.
   
   I haven't looked at the impl in a long time but I'm also trying to figure 
out why it happens. What in the code makes it not scale with the size of the 
input? because conceptually it should not.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to