Yuhao,

You can use pyspark as entrypoint to your application. With py4j you can call 
Java/Scala functions from the python application. There's no need to use the 
pipe() function for that.


Shay

________________________________
From: Yuhao Zhang <yhzhang1...@gmail.com>
Sent: Saturday, July 9, 2022 4:13:42 AM
To: user@spark.apache.org
Subject: [EXTERNAL] RDD.pipe() for binary data

ATTENTION: This email originated from outside of GM.



Hi All,

I'm currently working on a project involving transferring between  Spark 3.x (I 
use Scala) and a Python runtime. In Spark, data is stored in an RDD as 
floating-point number arrays/vectors and I have custom routines written in 
Python to process them. On the Spark side, I also have some operations specific 
to Spark Scala APIs, so I need to use both runtimes.

Now to achieve data transfer I've been using the RDD.pipe() API, by 1. 
converting the arrays to strings in Spark and calling RDD.pipe(script.py) 2. 
Then Python receives the strings and casts them as Python's data structures and 
conducts operations. 3. Python converts the arrays into strings and prints them 
back to Spark. 4. Spark gets the strings and cast them back as arrays.

Needless to say, this feels unnatural and slow to me, and there are some 
potential floating-point number precision issues, as I think the floating 
number arrays should have been transmitted as raw bytes. I found no way to use 
the RDD.pipe() for this purpose, as written in 
https://github.com/apache/spark/blob/3331d4ccb7df9aeb1972ed86472269a9dbd261ff/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L139,
 .pipe() seems to be locked with text-based streaming.

Can anyone shed some light on how I can achieve this? I'm trying to come up 
with a way that does not involve modifying the core Spark myself. One potential 
solution I can think of is saving/loading the RDD as binary files but I'm 
hoping to find a streaming-based solution. Any help is much appreciated, thanks!


Best regards,
Yuhao

Reply via email to