Hi, In my company, we've been trying to use PySpark to run ETLs on our data. Alas, it turned out to be terribly slow compared to Java or Scala API (which we ended up using to meet performance criteria).
To be more quantitative, let's consider simple case: I've generated test file (848MB): /seq 1 100000000 > /tmp/test/ and tried to run simple computation on it, which includes three steps: read -> multiply each row by 2 -> take max Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/ Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/ Here are the results of this simple benchmark: CPython - 59s PyPy - 26s Scala version - 7s I didn't dig into what exactly contributes to execution times of CPython / PyPy, but it seems that serialization / deserialization, when sending data to the worker may be the issue. I know some guys already have been asking about using Jython (http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658, http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html), but it seems, that no one have really done this with Spark. It looks like performance gain from using jython can be huge - you wouldn't need to spawn PythonWorkers, all the code would be just executed inside SparkExecutor JVM, using python code compiled to java bytecode. Do you think that's possible to achieve? Do you see any obvious obstacles? Of course, jython doesn't have C extensions, but if one doesn't need them, then it should fit here nicely. I'm willing to try to marry Spark with Jython and see how it goes. What do you think about this? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org