[GitHub] [spark] HyukjinKwon opened a new pull request, #36683: [SPARK-39301][PYTHON] Leverage LocalRelation in createDataFrame with Arrow optimization

GitBox Wed, 25 May 2022 22:28:06 -0700


HyukjinKwon opened a new pull request, #36683:
URL: https://github.com/apache/spark/pull/36683


   ### What changes were proposed in this pull request?
   
   This PR proposes to use `LocalRelation` instead of `LogicalRDD` when 
creating a DataFrame with Arrow optimization.
   In more details, we don't create an RDD anymore but pass the data as a local 
data in driver side, which is consistent with Scala code path.
   
   Namely:
   
   ```python
   import pandas as pd
   spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
   spark.createDataFrame(pd.DataFrame({'a': [1, 2, 3, 4]})).explain(True)
   ```
   
   **Before**
   
   ```
   == Parsed Logical Plan ==
   LogicalRDD [a#0L], false
   
   == Analyzed Logical Plan ==
   a: bigint
   LogicalRDD [a#0L], false
   
   == Optimized Logical Plan ==
   LogicalRDD [a#0L], false
   
   == Physical Plan ==
   *(1) Scan ExistingRDD arrow[a#0L]
   ```
   
   **After**
   
   ```
   == Parsed Logical Plan ==
   LocalRelation [a#0L]
   
   == Analyzed Logical Plan ==
   a: bigint
   LocalRelation [a#0L]
   
   == Optimized Logical Plan ==
   LocalRelation [a#0L]
   
   == Physical Plan ==
   LocalTableScan [a#0L]
   ```
   
   ### Why are the changes needed?
   
   We have some nice optimization for `LocalRelation`. For example, the stats 
are fully known when you use `LocalRelation`. With `LogicalRDD`, many 
optimizations cannot be applied. Even in some cases (e.g., `executeCollect`), 
we can avoid creating `RDD`s too.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it is an optimization.
   
   ### How was this patch tested?
   
   Manually tested. Benchmark is TBD.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request, #36683: [SPARK-39301][PYTHON] Leverage LocalRelation in createDataFrame with Arrow optimization

Reply via email to