Hey all,

I would like to propose changing the return type of `findSynonyms` in ml's
Word2Vec
<https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
:

def findSynonyms(word: String, num: Int): DataFrame = {
  val spark = SparkSession.builder().getOrCreate()
  spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word",
"similarity")
}

I find it very strange that the results are parallelized before being
returned to the user. The results are already on the driver to begin with,
and I can imagine that for most usecases (and definitely for ours) the
synonyms are collected right back to the driver. This incurs both an added
cost of shipping data to and from the cluster, as well as a more cumbersome
interface than needed.

Can we change it to just the following?

def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
  wordVectors.findSynonyms(word, num)
}

If the user wants the results parallelized, they can still do so on their
own.

(I had brought this up a while back in Jira. It was suggested that the
mailing list would be a better forum to discuss it, so here we are.)

Thanks,
-- 
Asher Krim
Senior Software Engineer

Reply via email to