I'm curious about using shared memory to speed up the JVM->Python round trip. Is there any sane way to do anonymous shared memory in Java/scale?
On Sat, Jul 16, 2022 at 16:10 Sebastian Piu <sebastian....@gmail.com> wrote: > Other alternatives are to look at how PythonRDD does it in spark, you > could also try to go for a more traditional setup where you expose your > python functions behind a local/remote service and call that from scala - > say over thrift/grpc/http/local socket etc. > Another option, but I've never done it so I'm not sure if it would work, > is to maybe look if arrow could help by sharing a piece of memory with the > data you need from scala and then read it from python > > On Sat, Jul 16, 2022 at 9:56 PM Sean Owen <sro...@gmail.com> wrote: > >> Use GraphFrames? >> >> On Sat, Jul 16, 2022 at 3:54 PM Yuhao Zhang <yhzhang1...@gmail.com> >> wrote: >> >>> Hi Shay, >>> >>> Thanks for your reply! I would very much like to use pyspark. However, >>> my project depends on GraphX, which is only available in the Scala API as >>> far as I know. So I'm locked with Scala and trying to find a way out. I >>> wonder if there's a way to go around it. >>> >>> Best regards, >>> Yuhao Zhang >>> >>> >>> On Sun, Jul 10, 2022 at 5:36 AM Shay Elbaz <shay.el...@gm.com> wrote: >>> >>>> Yuhao, >>>> >>>> >>>> You can use pyspark as entrypoint to your application. With py4j you >>>> can call Java/Scala functions from the python application. There's no need >>>> to use the pipe() function for that. >>>> >>>> >>>> Shay >>>> ------------------------------ >>>> *From:* Yuhao Zhang <yhzhang1...@gmail.com> >>>> *Sent:* Saturday, July 9, 2022 4:13:42 AM >>>> *To:* user@spark.apache.org >>>> *Subject:* [EXTERNAL] RDD.pipe() for binary data >>>> >>>> >>>> *ATTENTION:* This email originated from outside of GM. >>>> >>>> >>>> Hi All, >>>> >>>> I'm currently working on a project involving transferring between >>>> Spark 3.x (I use Scala) and a Python runtime. In Spark, data is stored in >>>> an RDD as floating-point number arrays/vectors and I have custom routines >>>> written in Python to process them. On the Spark side, I also have some >>>> operations specific to Spark Scala APIs, so I need to use both runtimes. >>>> >>>> Now to achieve data transfer I've been using the RDD.pipe() API, by 1. >>>> converting the arrays to strings in Spark and calling RDD.pipe(script.py) >>>> 2. Then Python receives the strings and casts them as Python's data >>>> structures and conducts operations. 3. Python converts the arrays into >>>> strings and prints them back to Spark. 4. Spark gets the strings and cast >>>> them back as arrays. >>>> >>>> Needless to say, this feels unnatural and slow to me, and there are >>>> some potential floating-point number precision issues, as I think the >>>> floating number arrays should have been transmitted as raw bytes. I found >>>> no way to use the RDD.pipe() for this purpose, as written in >>>> https://github.com/apache/spark/blob/3331d4ccb7df9aeb1972ed86472269a9dbd261ff/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L139, >>>> .pipe() seems to be locked with text-based streaming. >>>> >>>> Can anyone shed some light on how I can achieve this? I'm trying to >>>> come up with a way that does not involve modifying the core Spark myself. >>>> One potential solution I can think of is saving/loading the RDD as binary >>>> files but I'm hoping to find a streaming-based solution. Any help is much >>>> appreciated, thanks! >>>> >>>> >>>> Best regards, >>>> Yuhao >>>> >>> -- It's dark in this basement.