As mentioned in the documentation: PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.
I am trying to understand the implementation of how this data transfer is happening using pipes. Can anyone please guide me along that line?? Thanks, Amit Rana On 7 Jul 2016 13:44, "Sun Rui" <sunrise_...@163.com> wrote: > You can read > https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals > For pySpark data flow on worker nodes, you can read the source code of > PythonRDD.scala. Python worker processes communicate with Spark executors > via sockets instead of pipes. > > On Jul 7, 2016, at 15:49, Amit Rana <amitranavs...@gmail.com> wrote: > > Hi all, > > I am trying to trace the data flow in pyspark. I am using intellij IDEA > in windows 7. > I had submitted a python job as follows: > --master local[4] <path to pyspark job> <arguments to the job> > > I have made the following insights after running the above command in > debug mode: > ->Locally when a pyspark's interpreter starts, it also starts a JVM with > which it communicates through socket. > ->py4j is used to handle this communication > ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext > which communicates with the spark executors in cluster. > > In cluster I have read that the data flow between spark executors and > python interpreter happens using pipes. But I am not able to trace that > data flow. > > Please correct me if my understanding is wrong. It would be very helpful > if, someone can help me understand tge code-flow for data transfer between > JVM and python workers. > > Thanks, > Amit Rana > > >