Re: [DISCUSS] PySpark Window UDF

2018-09-20 Thread Felix Cheung
Definitely!
numba numbers are amazing


From: Wes McKinney 
Sent: Saturday, September 8, 2018 7:46 AM
To: Li Jin
Cc: dev@spark.apache.org
Subject: Re: [DISCUSS] PySpark Window UDF

hi Li,

These results are very cool. I'm excited to see you continuing to push
this effort forward.

- Wes
On Wed, Sep 5, 2018 at 5:52 PM Li Jin  wrote:
>
> Hello again!
>
> I recently implemented a proof-of-concept implementation of proposal above. I 
> think the results are pretty exciting so I want to share my findings with the 
> community. I have implemented two variants of the pandas window UDF - one 
> that takes pandas.Series as input and one that takes numpy array as input. I 
> benchmarked with rolling mean on 1M doubles and here are some results:
>
> Spark SQL window function: 20s
> Pandas variant: ~60s
> Numpy variant: 10s
> Numpy variant with numba: 4s
>
> You can see the benchmark code here:
> https://gist.github.com/icexelloss/845beb3d0d6bfc3d51b3c7419edf0dcb
>
> I think the results are quite exciting because:
> (1) numpy variant even outperforms the Spark SQL window function
> (2) numpy variant with numba has the best performance as well as the 
> flexibility to allow users to write window functions in pure python
>
> The Pandas variant is not bad either (1.5x faster than existing UDF with 
> collect_list) but the numpy variant definitely has much better performance.
>
> So far all Pandas UDFs interacts with Pandas data structure rather than numpy 
> data structure, but the window UDF result might be a good reason to open up 
> numpy variants of Pandas UDFs. What do people think? I'd love to hear 
> community's feedbacks.
>
>
> Links:
> You can reproduce benchmark with numpy variant by using the branch:
> https://github.com/icexelloss/spark/tree/window-udf-numpy
>
> PR link:
> https://github.com/apache/spark/pull/22305
>
> On Wed, May 16, 2018 at 3:34 PM Li Jin  wrote:
>>
>> Hi All,
>>
>> I have been looking into leverage the Arrow and Pandas UDF work we have done 
>> so far for Window UDF in PySpark. I have done some investigation and believe 
>> there is a way to do PySpark window UDF efficiently.
>>
>> The basic idea is instead of passing each window to Python separately, we 
>> can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices 
>> for each window (indices are computed on the Java side), and then rolling 
>> over the begin/end indices in Python and applies the UDF.
>>
>> I have written my investigation in more details here:
>> https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#
>>
>> I think this is a pretty promising and hope to get some feedback from the 
>> community about this approach. Let's discuss! :)
>>
>> Li

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] PySpark Window UDF

2018-09-08 Thread Wes McKinney
hi Li,

These results are very cool. I'm excited to see you continuing to push
this effort forward.

- Wes
On Wed, Sep 5, 2018 at 5:52 PM Li Jin  wrote:
>
> Hello again!
>
> I recently implemented a proof-of-concept implementation of proposal above. I 
> think the results are pretty exciting so I want to share my findings with the 
> community. I have implemented two variants of the pandas window UDF - one 
> that takes pandas.Series as input and one that takes numpy array as input. I 
> benchmarked with rolling mean on 1M doubles and here are some results:
>
> Spark SQL window function: 20s
> Pandas variant: ~60s
> Numpy variant: 10s
> Numpy variant with numba: 4s
>
> You can see the benchmark code here:
> https://gist.github.com/icexelloss/845beb3d0d6bfc3d51b3c7419edf0dcb
>
> I think the results are quite exciting because:
> (1) numpy variant even outperforms the Spark SQL window function
> (2) numpy variant with numba has the best performance as well as the 
> flexibility to allow users to write window functions in pure python
>
> The Pandas variant is not bad either (1.5x faster than existing UDF with 
> collect_list) but the numpy variant definitely has much better performance.
>
> So far all Pandas UDFs interacts with Pandas data structure rather than numpy 
> data structure, but the window UDF result might be a good reason to open up 
> numpy variants of Pandas UDFs. What do people think? I'd love to hear 
> community's feedbacks.
>
>
> Links:
> You can reproduce benchmark with numpy variant by using the branch:
> https://github.com/icexelloss/spark/tree/window-udf-numpy
>
> PR link:
> https://github.com/apache/spark/pull/22305
>
> On Wed, May 16, 2018 at 3:34 PM Li Jin  wrote:
>>
>> Hi All,
>>
>> I have been looking into leverage the Arrow and Pandas UDF work we have done 
>> so far for Window UDF in PySpark. I have done some investigation and believe 
>> there is a way to do PySpark window UDF efficiently.
>>
>> The basic idea is instead of passing each window to Python separately, we 
>> can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices 
>> for each window (indices are computed on the Java side), and then rolling 
>> over the begin/end indices in Python and applies the UDF.
>>
>> I have written my investigation in more details here:
>> https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#
>>
>> I think this is a pretty promising and hope to get some feedback from the 
>> community about this approach. Let's discuss! :)
>>
>> Li

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] PySpark Window UDF

2018-09-05 Thread Li Jin
Hello again!

I recently implemented a proof-of-concept implementation of proposal above.
I think the results are pretty exciting so I want to share my findings with
the community. I have implemented two variants of the pandas window UDF -
one that takes pandas.Series as input and one that takes numpy array as
input. I benchmarked with rolling mean on 1M doubles and here are some
results:

Spark SQL window function: 20s
Pandas variant: ~60s
Numpy variant: 10s
Numpy variant with numba: 4s

You can see the benchmark code here:
https://gist.github.com/icexelloss/845beb3d0d6bfc3d51b3c7419edf0dcb

I think the results are quite exciting because:
(1) numpy variant even outperforms the Spark SQL window function
(2) numpy variant with numba has the best performance as well as the
flexibility to allow users to write window functions in pure python

The Pandas variant is not bad either (1.5x faster than existing UDF with
collect_list) but the numpy variant definitely has much better performance.

So far all Pandas UDFs interacts with Pandas data structure rather than
numpy data structure, but the window UDF result might be a good reason to
open up numpy variants of Pandas UDFs. What do people think? I'd love to
hear community's feedbacks.


Links:
You can reproduce benchmark with numpy variant by using the branch:
https://github.com/icexelloss/spark/tree/window-udf-numpy

PR link:
https://github.com/apache/spark/pull/22305

On Wed, May 16, 2018 at 3:34 PM Li Jin  wrote:

> Hi All,
>
> I have been looking into leverage the Arrow and Pandas UDF work we have
> done so far for Window UDF in PySpark. I have done some investigation and
> believe there is a way to do PySpark window UDF efficiently.
>
> The basic idea is instead of passing each window to Python separately, we
> can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices
> for each window (indices are computed on the Java side), and then rolling
> over the begin/end indices in Python and applies the UDF.
>
> I have written my investigation in more details here:
>
> https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#
>
> I think this is a pretty promising and hope to get some feedback from the
> community about this approach. Let's discuss! :)
>
> Li
>


[DISCUSS] PySpark Window UDF

2018-05-16 Thread Li Jin
Hi All,

I have been looking into leverage the Arrow and Pandas UDF work we have
done so far for Window UDF in PySpark. I have done some investigation and
believe there is a way to do PySpark window UDF efficiently.

The basic idea is instead of passing each window to Python separately, we
can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices
for each window (indices are computed on the Java side), and then rolling
over the begin/end indices in Python and applies the UDF.

I have written my investigation in more details here:
https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#

I think this is a pretty promising and hope to get some feedback from the
community about this approach. Let's discuss! :)

Li