Hi All,

I'm currently working on a project involving transferring between  Spark
3.x (I use Scala) and a Python runtime. In Spark, data is stored in an RDD
as floating-point number arrays/vectors and I have custom routines written
in Python to process them. On the Spark side, I also have some operations
specific to Spark Scala APIs, so I need to use both runtimes.

Now to achieve data transfer I've been using the RDD.pipe() API, by 1.
converting the arrays to strings in Spark and calling RDD.pipe(script.py)
2. Then Python receives the strings and casts them as Python's data
structures and conducts operations. 3. Python converts the arrays into
strings and prints them back to Spark. 4. Spark gets the strings and cast
them back as arrays.

Needless to say, this feels unnatural and slow to me, and there are some
potential floating-point number precision issues, as I think the floating
number arrays should have been transmitted as raw bytes. I found no way to
use the RDD.pipe() for this purpose, as written in
https://github.com/apache/spark/blob/3331d4ccb7df9aeb1972ed86472269a9dbd261ff/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L139,
.pipe() seems to be locked with text-based streaming.

Can anyone shed some light on how I can achieve this? I'm trying to come up
with a way that does not involve modifying the core Spark myself. One
potential solution I can think of is saving/loading the RDD as binary files
but I'm hoping to find a streaming-based solution. Any help is much
appreciated, thanks!


Best regards,
Yuhao

Reply via email to