We returned a DataFrame since it is a nicer API, but I agree forcing RDD
operations is not ideal.  I'd be OK with adding a new method, but I agree
with Felix that we cannot break the API for something like this.

On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Given how Word2Vec is used the pipeline model in the new ml
> implementation, we might need to keep the current behavior?
>
>
> https://github.com/apache/spark/blob/master/examples/
> src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala
>
>
> _____________________________
> From: Asher Krim <ak...@hubspot.com>
> Sent: Tuesday, January 3, 2017 11:58 PM
> Subject: Re: ml word2vec finSynonyms return type
> To: Felix Cheung <felixcheun...@hotmail.com>
> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley <
> jos...@databricks.com>, <dev@spark.apache.org>
>
>
>
> The jira: https://issues.apache.org/jira/browse/SPARK-17629
>
> Adding new methods could result in method clutter. Changing behavior of
> non-experimental classes is unfortunate (ml Word2Vec was marked
> Experimental until Spark 2.0). Neither option is great. If I had to pick, I
> would rather change the existing methods to keep the class simpler moving
> forward.
>
>
> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> Could you link to the JIRA here?
>>
>> What you suggest makes sense to me. Though we might want to maintain
>> compatibility and add a new method instead of changing the return type of
>> the existing one.
>>
>>
>> _____________________________
>> From: Asher Krim <ak...@hubspot.com>
>> Sent: Wednesday, December 28, 2016 11:52 AM
>> Subject: ml word2vec finSynonyms return type
>> To: <dev@spark.apache.org>
>> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley <
>> jos...@databricks.com>
>>
>>
>>
>> Hey all,
>>
>> I would like to propose changing the return type of `findSynonyms` in
>> ml's Word2Vec
>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
>> :
>>
>> def findSynonyms(word: String, num: Int): DataFrame = {
>>   val spark = SparkSession.builder().getOrCreate()
>>   spark.createDataFrame(wordVectors.findSynonyms(word,
>> num)).toDF("word", "similarity")
>> }
>>
>> I find it very strange that the results are parallelized before being
>> returned to the user. The results are already on the driver to begin with,
>> and I can imagine that for most usecases (and definitely for ours) the
>> synonyms are collected right back to the driver. This incurs both an added
>> cost of shipping data to and from the cluster, as well as a more cumbersome
>> interface than needed.
>>
>> Can we change it to just the following?
>>
>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>>   wordVectors.findSynonyms(word, num)
>> }
>>
>> If the user wants the results parallelized, they can still do so on their
>> own.
>>
>> (I had brought this up a while back in Jira. It was suggested that the
>> mailing list would be a better forum to discuss it, so here we are.)
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Reply via email to