Understanding PySpark Internals

Adam Roberts Tue, 29 Mar 2016 08:21:36 -0700

Hi, I'm interested in figuring out how the Python API for Spark works, 
I've came to the following conclusion and want to share this with the 
community; could be of use in the PySpark docs here, specifically the 
"Execution and pipelining part".


Any sanity checking would be much appreciated, here's the trivial Python 
example I've traced:
from pyspark import SparkContext
sc = SparkContext("local[1]", "Adam test")
sc.setCheckpointDir("foo checkpoint dir")

Added this JVM option:
export 
IBM_JAVA_OPTIONS="-Xtrace:methods={org/apache/spark/*,py4j/*},print=mt"

Prints added in py4j-java/src/py4j/commands/CallCommand.java - 
specifically in the execute method. Built and replaced existing class in 
the py4j 0.9 jar in my Spark assembly jar. Example output is:
In execute for CallCommand, commandName: c
target object id: o0
methodName: get

I'll launch the Spark application with:
$SPARK_HOME/bin/spark-submit --master local[1] Adam.py > checkme.txt 2>&1

I've quickly put together the following WIP diagram of what I think is 
happening:
http://postimg.org/image/nihylmset/

To summarise I think:
We're heavily using reflection (as evidenced by Py4j's ReflectionEngine 
and MethodInvoker classes) to invoke Spark's API in a JVM from Python
There's an agreed protocol (in Py4j's Protocol.java) for handling 
commands: said commands are exchanged using a local socket between Python 
and our JVM (the driver based on docs, not the master)
The Spark API is accessible by means of commands exchanged using said 
socket using the agreed protocol
Commands are read/written using BufferedReader/Writer
Type conversion is also performed from Python to Java (not looked at in 
detail yet)
We keep track of the objects with, for example, o0 representing the first 
object we know about

Does this sound correct?

I've only checked the trace output in local mode, curious as to what 
happens when we're running in standalone mode (I didn't see a Python 
interpreter appearing on all workers in order to process partitions of 
data, I assume in standalone mode we use Python solely as an orchestrator 
- the driver - and not as an executor for distributed computing?).

Happy to provide the full trace output on request (omitted timestamps, 
logging info, added spacing), I expect there's a O*JDK method tracing 
equivalent so the above can easily be reproduced regardless of Java 
vendor.

Cheers,


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Understanding PySpark Internals

Reply via email to