Re: Calling Scala/Java methods which operates on RDD

Kan Zhang Fri, 11 Jul 2014 15:41:22 -0700

Hi Jai,

Your suspicion is correct. In general, Python RDDs are pickled into byte
arrays and stored in Java land as RDDs of byte arrays. union/zip operates
on byte arrays directly without deserializing. Currently, Python byte
arrays only get unpickled into Java objects in special cases, like SQL
functions or saving to Sequence Files (upcoming).


Hope it helps.

Kan


On Fri, Jul 11, 2014 at 5:04 AM, Jai Kumar Singh <[email protected]>
wrote:

> HI,
>   I want to write some common utility function in Scala and want to call
> the same from Java/Python Spark API ( may be add some wrapper code around
> scala calls). Calling Scala functions from Java works fine. I was reading
> pyspark rdd code and find out that pyspark is able to call JavaRDD function
> like union/zip to get same for pyspark RDD and deserializing the output and
> everything works fine. But somehow I am
> not able to work out really simple example. I think I am missing some
> serialization/deserialization.
>
> Can someone confirm that is it even possible to do so? Or, would it be much
> easier to pass RDD data files around instead of RDD directly (from pyspark
> to java/scala)?
>
> For example, below code just add 1 to each element of RDD containing
> Integers.
>
> package flukebox.test;
>
> object TestClass{
>
> def testFunc(data:RDD[Int])={
>
>   data.map(x => x+1)
>
> }
>
> }
>
> Calling from python,
>
> from pyspark import RDD
>
> from py4j.java_gateway import java_import
>
> java_import(sc._gateway.jvm, "flukebox.test")
>
>
> data = sc.parallelize([1,2,3,4,5,6,7,8,9])
>
> sc._jvm.flukebox.test.TestClass.testFunc(data._jrdd.rdd())
>
>
> *This fails because testFunc get any RDD of type Byte Array.*
>
>
> Any help/pointer would be highly appreciated.
>
>
> Thanks & Regards,
>
> Jai K Singh
>

Re: Calling Scala/Java methods which operates on RDD

Reply via email to