[jira] [Created] (SPARK-17629) Should ml Word2Vec findSynonyms match the mllib implementation?

Asher Krim (JIRA) Wed, 21 Sep 2016 18:56:45 -0700

Asher Krim created SPARK-17629:
----------------------------------

             Summary: Should ml Word2Vec findSynonyms match the mllib 
implementation?
                 Key: SPARK-17629
                 URL: https://issues.apache.org/jira/browse/SPARK-17629
             Project: Spark
          Issue Type: Question
            Reporter: Asher Krim
            Priority: Minor



ml Word2Vec's findSynonyms methods depart from mllib in that they return 
distributed results, rather than the results directly:

{code}
  def findSynonyms(word: String, num: Int): DataFrame = {
    val spark = SparkSession.builder().getOrCreate()
    spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
  }
{code}

What was the reason for this decision? I would think that most users would 
request a reasonably small number of results back, and want to use them 
directly on the driver, similar to the _take_ method on dataframes. Returning 
parallelized results creates a costly round trip for the data that doesn't seem 
necessary.

The original PR: https://github.com/apache/spark/pull/7263
[~MechCoder] - do you perhaps recall the reason?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-17629) Should ml Word2Vec findSynonyms match the mllib implementation?

Reply via email to