Re: Python, Spark and HBase

2015-08-03 Thread ericbless
I wanted to confirm whether this is now supported, such as in Spark v1.3.0

I've read varying info online  just thought I'd verify.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p24117.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Python, Spark and HBase

2014-05-29 Thread Nick Pentreath
Hi Tommer,

I'm working on updating and improving the PR, and will work on getting an
HBase example working with it. Will feed back as soon as I have had the
chance to work on this a bit more.

N


On Thu, May 29, 2014 at 3:27 AM, twizansk twiza...@gmail.com wrote:

 The code which causes the error is:

 The code which causes the error is:

 sc = SparkContext(local, My App)
 rdd = sc.newAPIHadoopFile(
 name,
 'org.apache.hadoop.hbase.mapreduce.TableInputFormat',
 'org.apache.hadoop.hbase.io.ImmutableBytesWritable',
 'org.apache.hadoop.hbase.client.Result',
 conf={hbase.zookeeper.quorum: my-host,
   hbase.rootdir: hdfs://my-host:8020/hbase,
   hbase.mapreduce.inputtable: data})

 The full stack trace is:



 Py4JError Traceback (most recent call last)
 ipython-input-8-3b9a4ea2f659 in module()
   7 conf={hbase.zookeeper.quorum: my-host,
   8   hbase.rootdir: hdfs://my-host:8020/hbase,
  9   hbase.mapreduce.inputtable: data})
  10
  11

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.pyc in
 newAPIHadoopFile(self, name, inputformat_class, key_class, value_class,
 key_wrapper, value_wrapper, conf)
 281 for k, v in conf.iteritems():
 282 jconf[k] = v
 -- 283 jrdd = self._jvm.PythonRDD.newAPIHadoopFile(self._jsc,
 name,
 inputformat_class, key_class, value_class,
 284 key_wrapper,
 value_wrapper, jconf)
 285 return RDD(jrdd, self, PickleSerializer())


 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py
 in __getattr__(self, name)
 657 else:
 658 raise Py4JError('{0} does not exist in the JVM'.
 -- 659 format(self._fqn + name))
 660
 661 def __call__(self, *args):

 Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile does not
 exist in the JVM



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6507.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Python, Spark and HBase

2014-05-28 Thread twizansk
Hi Nick,

I finally got around to downloading and building the patch.  

I pulled the code from
https://github.com/MLnick/spark-1/tree/pyspark-inputformats

I am running on a CDH5 node.  While the code in the CDH branch is different
from spark master, I do believe that I have resolved any inconsistencies.

When attempting to connect to an HBase table using
SparkContext.newAPIHadoopFile  I receive the following error:

Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile
does not exist in the JVM

I have searched the pyspark-inputformats branch and cannot find any
reference to the class org.apache.spark.api.python.PythonRDDnewAPIHadoopFile

Any ideas?

Also, do you have a working example of HBase access with the new code?

Thanks

Tommer  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Python, Spark and HBase

2014-05-28 Thread Matei Zaharia
It sounds like you made a typo in the code — perhaps you’re trying to call 
self._jvm.PythonRDDnewAPIHadoopFile instead of  
self._jvm.PythonRDD.newAPIHadoopFile? There should be a dot before the new.


Matei

On May 28, 2014, at 5:25 PM, twizansk twiza...@gmail.com wrote:

 Hi Nick,
 
 I finally got around to downloading and building the patch.  
 
 I pulled the code from
 https://github.com/MLnick/spark-1/tree/pyspark-inputformats
 
 I am running on a CDH5 node.  While the code in the CDH branch is different
 from spark master, I do believe that I have resolved any inconsistencies.
 
 When attempting to connect to an HBase table using
 SparkContext.newAPIHadoopFile  I receive the following error:
 
Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile
 does not exist in the JVM
 
 I have searched the pyspark-inputformats branch and cannot find any
 reference to the class org.apache.spark.api.python.PythonRDDnewAPIHadoopFile
 
 Any ideas?
 
 Also, do you have a working example of HBase access with the new code?
 
 Thanks
 
 Tommer  
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6502.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Python, Spark and HBase

2014-05-28 Thread twizansk
In my code I am not referencing PythonRDD or PythonRDDnewAPIHadoopFile at
all.  I am calling SparkContext.newAPIHadoopFile with: 

inputformat_class='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
key_class='org.apache.hadoop.hbase.io.ImmutableBytesWritable',
value_class='org.apache.hadoop.hbase.client.Result'

Is it possible that the typo is coming from inside the spark code?

Tommer



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6506.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Python, Spark and HBase

2014-05-28 Thread twizansk
The code which causes the error is:

The code which causes the error is:

sc = SparkContext(local, My App)
rdd = sc.newAPIHadoopFile(
name, 
'org.apache.hadoop.hbase.mapreduce.TableInputFormat',
'org.apache.hadoop.hbase.io.ImmutableBytesWritable',
'org.apache.hadoop.hbase.client.Result',
conf={hbase.zookeeper.quorum: my-host, 
  hbase.rootdir: hdfs://my-host:8020/hbase,
  hbase.mapreduce.inputtable: data})

The full stack trace is:



Py4JError Traceback (most recent call last)
ipython-input-8-3b9a4ea2f659 in module()
  7 conf={hbase.zookeeper.quorum: my-host, 
  8   hbase.rootdir: hdfs://my-host:8020/hbase,
 9   hbase.mapreduce.inputtable: data})
 10 
 11 

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.pyc in
newAPIHadoopFile(self, name, inputformat_class, key_class, value_class,
key_wrapper, value_wrapper, conf)
281 for k, v in conf.iteritems():
282 jconf[k] = v
-- 283 jrdd = self._jvm.PythonRDD.newAPIHadoopFile(self._jsc, name,
inputformat_class, key_class, value_class,
284 key_wrapper,
value_wrapper, jconf)
285 return RDD(jrdd, self, PickleSerializer())

/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py
in __getattr__(self, name)
657 else:
658 raise Py4JError('{0} does not exist in the JVM'.
-- 659 format(self._fqn + name))
660 
661 def __call__(self, *args):

Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile does not
exist in the JVM



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6507.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Python, Spark and HBase

2014-05-21 Thread twizansk
Thanks Nick and Matei.   I'll take a look at the patch and keep you updated.

Tommer



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Python, Spark and HBase

2014-05-20 Thread Matei Zaharia
Unfortunately this is not yet possible. There’s a patch in progress posted here 
though: https://github.com/apache/spark/pull/455 — it would be great to get 
your feedback on it.

Matei

On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote:

 Hello,
 
 This seems like a basic question but I have been unable to find an answer in
 the archives or other online sources.
 
 I would like to know if there is any way to load a RDD from HBase in python. 
 In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a
 TableInputFormat class.  Is there any equivalent in python?
 
 Thanks
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.