Hi,

In my company, we've been trying to use PySpark to run ETLs on our data.
Alas, it turned out to be terribly slow compared to Java or Scala API (which
we ended up using to meet performance criteria). 

To be more quantitative, let's consider simple case:
I've generated test file (848MB): /seq 1 100000000 > /tmp/test/

and tried to run simple computation on it, which includes three steps: read
-> multiply each row by 2 -> take max
Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/

Here are the results of this simple benchmark:
CPython - 59s
PyPy - 26s
Scala version - 7s

I didn't dig into what exactly contributes to execution times of CPython /
PyPy, but it seems that serialization / deserialization, when sending data
to the worker may be the issue. 
I know some guys already have been asking about using Jython
(http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658,
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html),
but it seems, that no one have really done this with Spark.
It looks like performance gain from using jython can be huge - you wouldn't
need to spawn PythonWorkers, all the code would be just executed inside
SparkExecutor JVM, using python code compiled to java bytecode. Do you think
that's possible to achieve? Do you see any obvious obstacles? Of course,
jython doesn't have C extensions, but if one doesn't need them, then it
should fit here nicely.

I'm willing to try to marry Spark with Jython and see how it goes.

What do you think about this?





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to