[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...

ueshin Mon, 25 Sep 2017 23:23:43 -0700

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/19349
  
    The performance test I did in my local based on @BryanCutler's 
(https://github.com/apache/spark/pull/18659#issuecomment-315879173) is as 
follows:
    
    ```python
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    
    @udf(DoubleType())
    def my_udf(p1, p2):
        from math import log, exp
        return exp(log(p1) + log(p2) - log(0.5))
    
    @pandas_udf(DoubleType())
    def my_pandas_udf(p1, p2):
        from numpy import log, exp
        return exp(log(p1) + log(p2) - log(0.5))
    
    df = spark.range(1 << 24, numPartitions=16).toDF("id") \
        .withColumn("p1", rand()).withColumn("p2", rand())
    df_udf = df.withColumn("p", my_udf(col("p1"), col("p2")))
    df_pandas_udf = df.withColumn("p", my_pandas_udf(col("p1"), col("p2")))
    ```
    
    ```
    %timeit -n2 df_udf.select(sum(col('p'))).collect()
    
    12.2 s Â± 456 ms per loop (mean Â± std. dev. of 7 runs, 2 loops each)
    ```
    
    ```
    spark.conf.set("spark.sql.execution.arrow.stream.enable", "false")
    %timeit -n2 df_pandas_udf.select(sum(col('p'))).collect()
    
    1.91 s Â± 195 ms per loop (mean Â± std. dev. of 7 runs, 2 loops each)
    ```
    
    ```
    spark.conf.set("spark.sql.execution.arrow.stream.enable", "true")
    %timeit -n2 df_pandas_udf.select(sum(col('p'))).collect()
    
    1.67 s Â± 223 ms per loop (mean Â± std. dev. of 7 runs, 2 loops each)
    ```
    
    Environment:
    
    - Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
    - Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01 on Mac OS X 10.12.6
    - Python 3.6.1 64bit [GCC 4.2.1 Compatible Apple LLVM 6.1.0 
(clang-602.0.53)]
      - pandas 0.20.1
      - pyarrow 0.4.1




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...

Reply via email to