Understanding pyspark data flow on worker nodes

Amit Rana Thu, 07 Jul 2016 00:50:15 -0700

Hi all,

I am trying  to trace the data flow in pyspark. I am using intellij IDEA in
windows 7.
I had submitted  a python  job as follows:
--master local[4] <path to pyspark  job> <arguments to the job>


I have made the following  insights after running the above command in
debug mode:
->Locally when a pyspark's interpreter starts, it also starts a JVM with
which it communicates through socket.
->py4j is used to handle this communication
->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
which communicates with the spark executors in cluster.

In cluster I have read that the data flow between spark executors and
python interpreter happens using pipes. But I am not able to trace that
data flow.

Please correct me if my understanding is wrong. It would be very helpful
if, someone can help me understand tge code-flow for data transfer between
JVM and python workers.

Thanks,
Amit Rana

Understanding pyspark data flow on worker nodes

Reply via email to