Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/19349
The performance test I did in my local based on @BryanCutler's
(https://github.com/apache/spark/pull/18659#issuecomment-315879173) is as
follows:
```python
from pyspark.sql.functions import *
from pyspark.sql.types import *
@udf(DoubleType())
def my_udf(p1, p2):
from math import log, exp
return exp(log(p1) + log(p2) - log(0.5))
@pandas_udf(DoubleType())
def my_pandas_udf(p1, p2):
from numpy import log, exp
return exp(log(p1) + log(p2) - log(0.5))
df = spark.range(1 << 24, numPartitions=16).toDF("id") \
.withColumn("p1", rand()).withColumn("p2", rand())
df_udf = df.withColumn("p", my_udf(col("p1"), col("p2")))
df_pandas_udf = df.withColumn("p", my_pandas_udf(col("p1"), col("p2")))
```
```
%timeit -n2 df_udf.select(sum(col('p'))).collect()
12.2 s ± 456 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
```
```
spark.conf.set("spark.sql.execution.arrow.stream.enable", "false")
%timeit -n2 df_pandas_udf.select(sum(col('p'))).collect()
1.91 s ± 195 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
```
```
spark.conf.set("spark.sql.execution.arrow.stream.enable", "true")
%timeit -n2 df_pandas_udf.select(sum(col('p'))).collect()
1.67 s ± 223 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
```
Environment:
- Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01 on Mac OS X 10.12.6
- Python 3.6.1 64bit [GCC 4.2.1 Compatible Apple LLVM 6.1.0
(clang-602.0.53)]
- pandas 0.20.1
- pyarrow 0.4.1
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]