Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79005872 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- Consider the case where you're passing in a vector that is extremely similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. (A concrete example is (in this test I added)[https://github.com/willb/spark/blob/fix/findSynonyms/mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala#L72].) In this case, `wordOpt` is not defined (because you are querying for words whose vector representations are similar to a given vector) but you nonetheless are not interested in discarding the best match because it is not a trivial match (that is, it is not going to be your query term). Related to your other comment (and discussion elsewhere on this PR), I think we could make a case for changing the documented behavior (especially since it is only documented as such in pyspark) in the case where `findSynonyms` is invoked with the vector representation of a word that is _in_ the vocabulary. Instead of rejecting the best match in that case, we could return it. The argument there is that such a result is telling you something you didn't necessarily know (otherwise, you'd probably be querying for a word and not a vector) and that there is an easy way to identify that such a match is trivially the best match. I recognize that changing documented behavior is a big deal, but in this case it seems like it could be the best way to address a minor bug.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org