[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

BryanCutler Wed, 27 Jun 2018 14:58:27 -0700

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/21546
  
    ### Memory Improvements
    
    **toPandas()**
    
    The most significant improvement is reduction of the upper bound space 
complexity in the JVM driver.  Before, the entire dataset was collected in the 
JVM first before sending it to Python.  With this change, as soon as a 
partition is collected, the result handler immediately sends it to Python, so 
the upper bound is the size of the largest partition.  Also, using the Arrow 
stream format is more efficient because the schema is written once per stream, 
followed by record batches.  The schema is now only send from driver JVM to 
Python.  Before, multiple Arrow file formats were used that each contained the 
schema.  This duplicated schema was created in the executors, sent to the 
driver JVM, and then Python where all but the first one received are discarded.
    
    I verified the upper bound limit by running a test that would collect data 
that would exceed the amount of driver JVM memory available.  Using these 
settings on a standalone cluster:
    ```
    spark.driver.memory              1g
    spark.executor.memory            5g
    spark.sql.execution.arrow.enabled  true
    spark.sql.execution.arrow.fallback.enabled  false
    spark.sql.execution.arrow.maxRecordsPerBatch 0
    spark.driver.maxResultSize 2g
    ```
    
    Test code:
    ```python
    from pyspark.sql.functions import rand
    df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", 
rand()).withColumn("x2", rand()).withColumn("x3", rand())
    df.toPandas()
    ```
    
    This makes total data size of 33554432Ã8Ã4 = 1073741824
    
    With the current master, it fails with OOM but passes using this PR.
    
    
    **createDataFrame()**
    
    No significant change in memory except that using the stream format instead 
of separate file formats avoids duplicated the schema, similar to toPandas 
above.  The process of reading the stream and parallelizing the batches does 
cause the record batch message metadata to be copied, but it's size is 
insignificant.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21546: [WIP][SPARK-23030][SQL][PYTHON] Use Arrow stream format ...

Reply via email to