[GitHub] spark issue #17451: [SPARK-19866][ML][PySpark] Add local version of Word2Vec...

MLnick Mon, 03 Jul 2017 11:26:02 -0700

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17451
  
    This is what I highlighted a while back - I think it's an issue with Py4J
    not converting Scala tuples.
    
    That's why we do need a private method to convert to list or array -
    something that can be converted on the python side
    On Mon, 3 Jul 2017 at 20:20, Xin Ren <[email protected]> wrote:
    
    > *@keypointt* commented on this pull request.
    > ------------------------------
    >
    > In mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala
    > <https://github.com/apache/spark/pull/17451#discussion_r125339579>:
    >
    > > @@ -274,6 +274,29 @@ class Word2VecModel private[ml] (
    >      wordVectors.findSynonyms(word, num)
    >    }
    >
    > +  /**
    >
    > Hi Holden, I tried to call original findSynonymsArray() in scala from
    > python side
    >
    > >>> from pyspark.ml.feature import Word2Vec
    > >>> sent = ("a b " * 100 + "a c " * 10).split(" ")
    > >>> df = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
    > >>> word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", 
outputCol="model")
    > >>> model = word2Vec.fit(df)
    > >>> a = model.findSynonymsArray("a", 2)
    >
    > and python getting a returned list of dict as below, and _1() and _2()
    > cannot get actual data, just getting a string u'scala.Tuple2', as shown
    > below.
    >
    > Maybe I'm missing something here? could you please help on how to get data
    > here? thanks a lot
    >
    > >>> a
    > [{u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}]
    > >>> len(a)
    > 2
    > >>> a[0]
    > {u'__class__': u'scala.Tuple2'}
    > >>> for e in a[0]:
    > ...     print ''.join(a[0][e])
    > ...
    > scala.Tuple2
    > >>> for e in a[0]:
    > ...     print a[0][e]._1()
    > ...
    > Traceback (most recent call last):
    >   File "<stdin>", line 2, in <module>
    > AttributeError: 'unicode' object has no attribute '_1'
    > >>> for e in a[0]:
    > ...     print a[0][e]._2()
    > ...
    > Traceback (most recent call last):
    >   File "<stdin>", line 2, in <module>
    > AttributeError: 'unicode' object has no attribute '_2'
    >
    > â
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/17451#discussion_r125339579>, or 
mute
    > the thread
    > 
<https://github.com/notifications/unsubscribe-auth/AA_SByleX4zkoNOlJlcw2nP8cqh48b9Lks5sKTDdgaJpZM4MrN1S>
    > .
    >




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17451: [SPARK-19866][ML][PySpark] Add local version of Word2Vec...

Reply via email to