Hi all, I am trying to trace the data flow in pyspark. I am using intellij IDEA in windows 7. I had submitted a python job as follows: --master local[4] <path to pyspark job> <arguments to the job>
I have made the following insights after running the above command in debug mode: ->Locally when a pyspark's interpreter starts, it also starts a JVM with which it communicates through socket. ->py4j is used to handle this communication ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext which communicates with the spark executors in cluster. In cluster I have read that the data flow between spark executors and python interpreter happens using pipes. But I am not able to trace that data flow. Please correct me if my understanding is wrong. It would be very helpful if, someone can help me understand tge code-flow for data transfer between JVM and python workers. Thanks, Amit Rana