Re: How to speed PySpark to match Scala/Java performance

Davies Liu Thu, 29 Jan 2015 13:11:32 -0800

Hey,

Without having Python as fast as Scala/Java, I think it's impossible to similar
performance in PySpark as in Scala/Java. Jython is also much slower than
Scala/Java.


With Jython, we can avoid the cost of manage multiple process and RPC,
we may still need to do the data conversion between Java and Python.
Given that fact that Jython is not widely used in production, it may introduce
more troubles than the performance gain.

Spark jobs can be easily speed up by scaling out (by adding more resources).
I think the most advantage of PySpark is that it let you do fast prototype.
Once you got your ETL finalized, it's not that hard to translate your
pure Python
jobs into Scala to reduce the cost(it's optional).

Now days, engineer time is much more expensive than CPU time, I think we
should be more focus on the former.

That's my 2 cents.

Davies

On Thu, Jan 29, 2015 at 12:45 PM, rtshadow
<[email protected]> wrote:
> Hi,
>
> In my company, we've been trying to use PySpark to run ETLs on our data.
> Alas, it turned out to be terribly slow compared to Java or Scala API (which
> we ended up using to meet performance criteria).
>
> To be more quantitative, let's consider simple case:
> I've generated test file (848MB): /seq 1 100000000 > /tmp/test/
>
> and tried to run simple computation on it, which includes three steps: read
> -> multiply each row by 2 -> take max
> Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>
> Here are the results of this simple benchmark:
> CPython - 59s
> PyPy - 26s
> Scala version - 7s
>
> I didn't dig into what exactly contributes to execution times of CPython /
> PyPy, but it seems that serialization / deserialization, when sending data
> to the worker may be the issue.
> I know some guys already have been asking about using Jython
> (http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658,
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html),
> but it seems, that no one have really done this with Spark.
> It looks like performance gain from using jython can be huge - you wouldn't
> need to spawn PythonWorkers, all the code would be just executed inside
> SparkExecutor JVM, using python code compiled to java bytecode. Do you think
> that's possible to achieve? Do you see any obvious obstacles? Of course,
> jython doesn't have C extensions, but if one doesn't need them, then it
> should fit here nicely.
>
> I'm willing to try to marry Spark with Jython and see how it goes.
>
> What do you think about this?
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How to speed PySpark to match Scala/Java performance

Reply via email to