Bobby Chowdary created ZEPPELIN-97:
--------------------------------------
Summary: pyspark issue with mllib api
Key: ZEPPELIN-97
URL: https://issues.apache.org/jira/browse/ZEPPELIN-97
Project: Zeppelin
Issue Type: Bug
Components: Interpreters
Affects Versions: 0.5.0
Environment: spark 1.4 on mapr hadoop, running on centos 7.0
Reporter: Bobby Chowdary
pyspark interpreter seems to have issue accessing python RDD
{code}
import numpy as np
from sklearn.cross_validation import train_test_split
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
X = np.random.rand(100,3)
y = np.random.randint(5,size=100)
trainX,testX,trainy,testy = train_test_split(X,y,test_size=0.2)
training = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for
(xrow,ylabel) in zip(trainX,trainy)])
testing = sc.parallelize([LabeledPoint(ylabel,Vectors.dense(xrow)) for
(xrow,ylabel) in zip(testX,testy)])
model = NaiveBayes.train(training, 0.1)
{code}
h4. Error:
{noformat}
(<type 'exceptions.AttributeError'>, AttributeError("'list' object has no
attribute '_get_object_id'",), <traceback object at 0x392b638>)
{noformat}
above code runs fine from pyspark shell. Also tested other features like data
frames from zepellin pyspark interpreter and they seem to work fine as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)