We returned a DataFrame since it is a nicer API, but I agree forcing RDD operations is not ideal. I'd be OK with adding a new method, but I agree with Felix that we cannot break the API for something like this.
On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Given how Word2Vec is used the pipeline model in the new ml > implementation, we might need to keep the current behavior? > > > https://github.com/apache/spark/blob/master/examples/ > src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala > > > _____________________________ > From: Asher Krim <ak...@hubspot.com> > Sent: Tuesday, January 3, 2017 11:58 PM > Subject: Re: ml word2vec finSynonyms return type > To: Felix Cheung <felixcheun...@hotmail.com> > Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley < > jos...@databricks.com>, <dev@spark.apache.org> > > > > The jira: https://issues.apache.org/jira/browse/SPARK-17629 > > Adding new methods could result in method clutter. Changing behavior of > non-experimental classes is unfortunate (ml Word2Vec was marked > Experimental until Spark 2.0). Neither option is great. If I had to pick, I > would rather change the existing methods to keep the class simpler moving > forward. > > > On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> Could you link to the JIRA here? >> >> What you suggest makes sense to me. Though we might want to maintain >> compatibility and add a new method instead of changing the return type of >> the existing one. >> >> >> _____________________________ >> From: Asher Krim <ak...@hubspot.com> >> Sent: Wednesday, December 28, 2016 11:52 AM >> Subject: ml word2vec finSynonyms return type >> To: <dev@spark.apache.org> >> Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley < >> jos...@databricks.com> >> >> >> >> Hey all, >> >> I would like to propose changing the return type of `findSynonyms` in >> ml's Word2Vec >> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248> >> : >> >> def findSynonyms(word: String, num: Int): DataFrame = { >> val spark = SparkSession.builder().getOrCreate() >> spark.createDataFrame(wordVectors.findSynonyms(word, >> num)).toDF("word", "similarity") >> } >> >> I find it very strange that the results are parallelized before being >> returned to the user. The results are already on the driver to begin with, >> and I can imagine that for most usecases (and definitely for ours) the >> synonyms are collected right back to the driver. This incurs both an added >> cost of shipping data to and from the cluster, as well as a more cumbersome >> interface than needed. >> >> Can we change it to just the following? >> >> def findSynonyms(word: String, num: Int): Array[(String, Double)] = { >> wordVectors.findSynonyms(word, num) >> } >> >> If the user wants the results parallelized, they can still do so on their >> own. >> >> (I had brought this up a while back in Jira. It was suggested that the >> mailing list would be a better forum to discuss it, so here we are.) >> >> Thanks, >> -- >> Asher Krim >> Senior Software Engineer >> >> > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>