Re: How to speed PySpark to match Scala/Java performance

Reynold Xin Thu, 29 Jan 2015 16:44:38 -0800

It is something like this: https://issues.apache.org/jira/browse/SPARK-5097


On the master branch, we have a Pandas like API already.


On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski <[email protected]> wrote:

> Hi Reynold,
> In my project I want to use Python API too.
> When you mention DF's are we talking about pandas or this is something
> internal to spark py api.
> If you could elaborate a bit on this or point me to alternate
> documentation.
> Thanks much --sasha
>
> On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin <[email protected]> wrote:
>
>> Once the data frame API is released for 1.3, you can write your thing in
>> Python and get the same performance. It can't express everything, but for
>> basic things like projection, filter, join, aggregate and simple numeric
>> computation, it should work pretty well.
>>
>>
>> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <
>> [email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > In my company, we've been trying to use PySpark to run ETLs on our data.
>> > Alas, it turned out to be terribly slow compared to Java or Scala API
>> > (which
>> > we ended up using to meet performance criteria).
>> >
>> > To be more quantitative, let's consider simple case:
>> > I've generated test file (848MB): /seq 1 100000000 > /tmp/test/
>> >
>> > and tried to run simple computation on it, which includes three steps:
>> read
>> > -> multiply each row by 2 -> take max
>> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
>> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>> >
>> > Here are the results of this simple benchmark:
>> > CPython - 59s
>> > PyPy - 26s
>> > Scala version - 7s
>> >
>> > I didn't dig into what exactly contributes to execution times of
>> CPython /
>> > PyPy, but it seems that serialization / deserialization, when sending
>> data
>> > to the worker may be the issue.
>> > I know some guys already have been asking about using Jython
>> > (
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
>> > ,
>> >
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
>> > ),
>> > but it seems, that no one have really done this with Spark.
>> > It looks like performance gain from using jython can be huge - you
>> wouldn't
>> > need to spawn PythonWorkers, all the code would be just executed inside
>> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
>> > think
>> > that's possible to achieve? Do you see any obvious obstacles? Of course,
>> > jython doesn't have C extensions, but if one doesn't need them, then it
>> > should fit here nicely.
>> >
>> > I'm willing to try to marry Spark with Jython and see how it goes.
>> >
>> > What do you think about this?
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>>
>
>
>
> --
> Aleksandar Kacanski
>

Re: How to speed PySpark to match Scala/Java performance

Reply via email to