Bobby Chowdary created ZEPPELIN-97:
--------------------------------------

             Summary: pyspark issue with mllib api
                 Key: ZEPPELIN-97
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-97
             Project: Zeppelin
          Issue Type: Bug
          Components: Interpreters
    Affects Versions: 0.5.0
         Environment: spark 1.4 on mapr hadoop, running on centos 7.0
            Reporter: Bobby Chowdary


pyspark interpreter seems to have issue accessing python RDD

{code}
import numpy as np
from sklearn.cross_validation import train_test_split
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint 

X = np.random.rand(100,3)
y = np.random.randint(5,size=100)

trainX,testX,trainy,testy = train_test_split(X,y,test_size=0.2)

training = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for 
(xrow,ylabel) in zip(trainX,trainy)])
testing = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for 
(xrow,ylabel) in zip(testX,testy)])

model = NaiveBayes.train(training, 0.1)
{code}

h4. Error:
{noformat}
(<type 'exceptions.AttributeError'>, AttributeError("'list' object has no 
attribute '_get_object_id'",), <traceback object at 0x392b638>)
{noformat}

above code runs fine from pyspark shell. Also tested other features like data 
frames from zepellin pyspark interpreter and they seem to work fine as well.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to