Github user willb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15105#discussion_r79005872
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
         wordList.zip(cosVec)
           .toSeq
           .sortBy(-_._2)
    -      .take(num + 1)
    -      .tail
    +      .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
    --- End diff --
    
    Consider the case where you're passing in a vector that is extremely 
similar to the representation of a word in the vocabulary but that is not 
itself the representation of a word in the vocabulary.  (A concrete example is 
(in this test I 
added)[https://github.com/willb/spark/blob/fix/findSynonyms/mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala#L72].)
  In this case, `wordOpt` is not defined (because you are querying for words 
whose vector representations are similar to a given vector) but you nonetheless 
are not interested in discarding the best match because it is not a trivial 
match (that is, it is not going to be your query term).
    
    Related to your other comment (and discussion elsewhere on this PR), I 
think we could make a case for changing the documented behavior (especially 
since it is only documented as such in pyspark) in the case where 
`findSynonyms` is invoked with the vector representation of a word that is _in_ 
the vocabulary.  Instead of rejecting the best match in that case, we could 
return it. The argument there is that such a result is telling you something 
you didn't necessarily know (otherwise, you'd probably be querying for a word 
and not a vector) and that there is an easy way to identify that such a match 
is trivially the best match.  I recognize that changing documented behavior is 
a big deal, but in this case it seems like it could be the best way to address 
a minor bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to