Re: [EXTERNAL] RDD.pipe() for binary data

Andrew Melo Sat, 16 Jul 2022 14:15:40 -0700

I'm curious about using shared memory to speed up the JVM->Python round
trip. Is there any sane way to do anonymous shared memory in Java/scale?


On Sat, Jul 16, 2022 at 16:10 Sebastian Piu <sebastian....@gmail.com> wrote:

> Other alternatives are to look at how PythonRDD does it in spark, you
> could also try to go for a more traditional setup where you expose your
> python functions behind a local/remote service and call that from scala -
> say over thrift/grpc/http/local socket etc.
> Another option, but I've never done it so I'm not sure if it would work,
> is to maybe look if arrow could help by sharing a piece of memory with the
> data you need from scala and then read it from python
>
> On Sat, Jul 16, 2022 at 9:56 PM Sean Owen <sro...@gmail.com> wrote:
>
>> Use GraphFrames?
>>
>> On Sat, Jul 16, 2022 at 3:54 PM Yuhao Zhang <yhzhang1...@gmail.com>
>> wrote:
>>
>>> Hi Shay,
>>>
>>> Thanks for your reply! I would very much like to use pyspark. However,
>>> my project depends on GraphX, which is only available in the Scala API as
>>> far as I know. So I'm locked with Scala and trying to find a way out. I
>>> wonder if there's a way to go around it.
>>>
>>> Best regards,
>>> Yuhao Zhang
>>>
>>>
>>> On Sun, Jul 10, 2022 at 5:36 AM Shay Elbaz <shay.el...@gm.com> wrote:
>>>
>>>> Yuhao,
>>>>
>>>>
>>>> You can use pyspark as entrypoint to your application. With py4j you
>>>> can call Java/Scala functions from the python application. There's no need
>>>> to use the pipe() function for that.
>>>>
>>>>
>>>> Shay
>>>> ------------------------------
>>>> *From:* Yuhao Zhang <yhzhang1...@gmail.com>
>>>> *Sent:* Saturday, July 9, 2022 4:13:42 AM
>>>> *To:* user@spark.apache.org
>>>> *Subject:* [EXTERNAL] RDD.pipe() for binary data
>>>>
>>>>
>>>> *ATTENTION:* This email originated from outside of GM.
>>>>
>>>>
>>>> Hi All,
>>>>
>>>> I'm currently working on a project involving transferring between
>>>> Spark 3.x (I use Scala) and a Python runtime. In Spark, data is stored in
>>>> an RDD as floating-point number arrays/vectors and I have custom routines
>>>> written in Python to process them. On the Spark side, I also have some
>>>> operations specific to Spark Scala APIs, so I need to use both runtimes.
>>>>
>>>> Now to achieve data transfer I've been using the RDD.pipe() API, by 1.
>>>> converting the arrays to strings in Spark and calling RDD.pipe(script.py)
>>>> 2. Then Python receives the strings and casts them as Python's data
>>>> structures and conducts operations. 3. Python converts the arrays into
>>>> strings and prints them back to Spark. 4. Spark gets the strings and cast
>>>> them back as arrays.
>>>>
>>>> Needless to say, this feels unnatural and slow to me, and there are
>>>> some potential floating-point number precision issues, as I think the
>>>> floating number arrays should have been transmitted as raw bytes. I found
>>>> no way to use the RDD.pipe() for this purpose, as written in
>>>> https://github.com/apache/spark/blob/3331d4ccb7df9aeb1972ed86472269a9dbd261ff/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L139,
>>>> .pipe() seems to be locked with text-based streaming.
>>>>
>>>> Can anyone shed some light on how I can achieve this? I'm trying to
>>>> come up with a way that does not involve modifying the core Spark myself.
>>>> One potential solution I can think of is saving/loading the RDD as binary
>>>> files but I'm hoping to find a streaming-based solution. Any help is much
>>>> appreciated, thanks!
>>>>
>>>>
>>>> Best regards,
>>>> Yuhao
>>>>
>>> --
It's dark in this basement.

Re: [EXTERNAL] RDD.pipe() for binary data

Reply via email to