Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19349 The performance test I did in my local based on @BryanCutler's (https://github.com/apache/spark/pull/18659#issuecomment-315879173) is as follows: ```python from pyspark.sql.functions import * from pyspark.sql.types import * @udf(DoubleType()) def my_udf(p1, p2): from math import log, exp return exp(log(p1) + log(p2) - log(0.5)) @pandas_udf(DoubleType()) def my_pandas_udf(p1, p2): from numpy import log, exp return exp(log(p1) + log(p2) - log(0.5)) df = spark.range(1 << 24, numPartitions=16).toDF("id") \ .withColumn("p1", rand()).withColumn("p2", rand()) df_udf = df.withColumn("p", my_udf(col("p1"), col("p2"))) df_pandas_udf = df.withColumn("p", my_pandas_udf(col("p1"), col("p2"))) ``` ``` %timeit -n2 df_udf.select(sum(col('p'))).collect() 12.2 s ± 456 ms per loop (mean ± std. dev. of 7 runs, 2 loops each) ``` ``` spark.conf.set("spark.sql.execution.arrow.stream.enable", "false") %timeit -n2 df_pandas_udf.select(sum(col('p'))).collect() 1.91 s ± 195 ms per loop (mean ± std. dev. of 7 runs, 2 loops each) ``` ``` spark.conf.set("spark.sql.execution.arrow.stream.enable", "true") %timeit -n2 df_pandas_udf.select(sum(col('p'))).collect() 1.67 s ± 223 ms per loop (mean ± std. dev. of 7 runs, 2 loops each) ``` Environment: - Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz - Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01 on Mac OS X 10.12.6 - Python 3.6.1 64bit [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] - pandas 0.20.1 - pyarrow 0.4.1
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org