Re: Python, Spark and HBase
I wanted to confirm whether this is now supported, such as in Spark v1.3.0 I've read varying info online just thought I'd verify. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p24117.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Python, Spark and HBase
Hi Tommer, I'm working on updating and improving the PR, and will work on getting an HBase example working with it. Will feed back as soon as I have had the chance to work on this a bit more. N On Thu, May 29, 2014 at 3:27 AM, twizansk twiza...@gmail.com wrote: The code which causes the error is: The code which causes the error is: sc = SparkContext(local, My App) rdd = sc.newAPIHadoopFile( name, 'org.apache.hadoop.hbase.mapreduce.TableInputFormat', 'org.apache.hadoop.hbase.io.ImmutableBytesWritable', 'org.apache.hadoop.hbase.client.Result', conf={hbase.zookeeper.quorum: my-host, hbase.rootdir: hdfs://my-host:8020/hbase, hbase.mapreduce.inputtable: data}) The full stack trace is: Py4JError Traceback (most recent call last) ipython-input-8-3b9a4ea2f659 in module() 7 conf={hbase.zookeeper.quorum: my-host, 8 hbase.rootdir: hdfs://my-host:8020/hbase, 9 hbase.mapreduce.inputtable: data}) 10 11 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.pyc in newAPIHadoopFile(self, name, inputformat_class, key_class, value_class, key_wrapper, value_wrapper, conf) 281 for k, v in conf.iteritems(): 282 jconf[k] = v -- 283 jrdd = self._jvm.PythonRDD.newAPIHadoopFile(self._jsc, name, inputformat_class, key_class, value_class, 284 key_wrapper, value_wrapper, jconf) 285 return RDD(jrdd, self, PickleSerializer()) /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py in __getattr__(self, name) 657 else: 658 raise Py4JError('{0} does not exist in the JVM'. -- 659 format(self._fqn + name)) 660 661 def __call__(self, *args): Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile does not exist in the JVM -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6507.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Python, Spark and HBase
Hi Nick, I finally got around to downloading and building the patch. I pulled the code from https://github.com/MLnick/spark-1/tree/pyspark-inputformats I am running on a CDH5 node. While the code in the CDH branch is different from spark master, I do believe that I have resolved any inconsistencies. When attempting to connect to an HBase table using SparkContext.newAPIHadoopFile I receive the following error: Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile does not exist in the JVM I have searched the pyspark-inputformats branch and cannot find any reference to the class org.apache.spark.api.python.PythonRDDnewAPIHadoopFile Any ideas? Also, do you have a working example of HBase access with the new code? Thanks Tommer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6502.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Python, Spark and HBase
It sounds like you made a typo in the code — perhaps you’re trying to call self._jvm.PythonRDDnewAPIHadoopFile instead of self._jvm.PythonRDD.newAPIHadoopFile? There should be a dot before the new. Matei On May 28, 2014, at 5:25 PM, twizansk twiza...@gmail.com wrote: Hi Nick, I finally got around to downloading and building the patch. I pulled the code from https://github.com/MLnick/spark-1/tree/pyspark-inputformats I am running on a CDH5 node. While the code in the CDH branch is different from spark master, I do believe that I have resolved any inconsistencies. When attempting to connect to an HBase table using SparkContext.newAPIHadoopFile I receive the following error: Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile does not exist in the JVM I have searched the pyspark-inputformats branch and cannot find any reference to the class org.apache.spark.api.python.PythonRDDnewAPIHadoopFile Any ideas? Also, do you have a working example of HBase access with the new code? Thanks Tommer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6502.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Python, Spark and HBase
In my code I am not referencing PythonRDD or PythonRDDnewAPIHadoopFile at all. I am calling SparkContext.newAPIHadoopFile with: inputformat_class='org.apache.hadoop.hbase.mapreduce.TableInputFormat' key_class='org.apache.hadoop.hbase.io.ImmutableBytesWritable', value_class='org.apache.hadoop.hbase.client.Result' Is it possible that the typo is coming from inside the spark code? Tommer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6506.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Python, Spark and HBase
The code which causes the error is: The code which causes the error is: sc = SparkContext(local, My App) rdd = sc.newAPIHadoopFile( name, 'org.apache.hadoop.hbase.mapreduce.TableInputFormat', 'org.apache.hadoop.hbase.io.ImmutableBytesWritable', 'org.apache.hadoop.hbase.client.Result', conf={hbase.zookeeper.quorum: my-host, hbase.rootdir: hdfs://my-host:8020/hbase, hbase.mapreduce.inputtable: data}) The full stack trace is: Py4JError Traceback (most recent call last) ipython-input-8-3b9a4ea2f659 in module() 7 conf={hbase.zookeeper.quorum: my-host, 8 hbase.rootdir: hdfs://my-host:8020/hbase, 9 hbase.mapreduce.inputtable: data}) 10 11 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.pyc in newAPIHadoopFile(self, name, inputformat_class, key_class, value_class, key_wrapper, value_wrapper, conf) 281 for k, v in conf.iteritems(): 282 jconf[k] = v -- 283 jrdd = self._jvm.PythonRDD.newAPIHadoopFile(self._jsc, name, inputformat_class, key_class, value_class, 284 key_wrapper, value_wrapper, jconf) 285 return RDD(jrdd, self, PickleSerializer()) /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py in __getattr__(self, name) 657 else: 658 raise Py4JError('{0} does not exist in the JVM'. -- 659 format(self._fqn + name)) 660 661 def __call__(self, *args): Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile does not exist in the JVM -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6507.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Python, Spark and HBase
Thanks Nick and Matei. I'll take a look at the patch and keep you updated. Tommer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6176.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Python, Spark and HBase
Unfortunately this is not yet possible. There’s a patch in progress posted here though: https://github.com/apache/spark/pull/455 — it would be great to get your feedback on it. Matei On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote: Hello, This seems like a basic question but I have been unable to find an answer in the archives or other online sources. I would like to know if there is any way to load a RDD from HBase in python. In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a TableInputFormat class. Is there any equivalent in python? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142.html Sent from the Apache Spark User List mailing list archive at Nabble.com.