Asher Krim created SPARK-17629:
----------------------------------

             Summary: Should ml Word2Vec findSynonyms match the mllib 
implementation?
                 Key: SPARK-17629
                 URL: https://issues.apache.org/jira/browse/SPARK-17629
             Project: Spark
          Issue Type: Question
            Reporter: Asher Krim
            Priority: Minor


ml Word2Vec's findSynonyms methods depart from mllib in that they return 
distributed results, rather than the results directly:

{code}
  def findSynonyms(word: String, num: Int): DataFrame = {
    val spark = SparkSession.builder().getOrCreate()
    spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
  }
{code}

What was the reason for this decision? I would think that most users would 
request a reasonably small number of results back, and want to use them 
directly on the driver, similar to the _take_ method on dataframes. Returning 
parallelized results creates a costly round trip for the data that doesn't seem 
necessary.

The original PR: https://github.com/apache/spark/pull/7263
[~MechCoder] - do you perhaps recall the reason?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to