HyukjinKwon opened a new pull request, #36683:
URL: https://github.com/apache/spark/pull/36683
### What changes were proposed in this pull request?
This PR proposes to use `LocalRelation` instead of `LogicalRDD` when
creating a DataFrame with Arrow optimization.
In more details, we don't create an RDD anymore but pass the data as a local
data in driver side, which is consistent with Scala code path.
Namely:
```python
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
spark.createDataFrame(pd.DataFrame({'a': [1, 2, 3, 4]})).explain(True)
```
**Before**
```
== Parsed Logical Plan ==
LogicalRDD [a#0L], false
== Analyzed Logical Plan ==
a: bigint
LogicalRDD [a#0L], false
== Optimized Logical Plan ==
LogicalRDD [a#0L], false
== Physical Plan ==
*(1) Scan ExistingRDD arrow[a#0L]
```
**After**
```
== Parsed Logical Plan ==
LocalRelation [a#0L]
== Analyzed Logical Plan ==
a: bigint
LocalRelation [a#0L]
== Optimized Logical Plan ==
LocalRelation [a#0L]
== Physical Plan ==
LocalTableScan [a#0L]
```
### Why are the changes needed?
We have some nice optimization for `LocalRelation`. For example, the stats
are fully known when you use `LocalRelation`. With `LogicalRDD`, many
optimizations cannot be applied. Even in some cases (e.g., `executeCollect`),
we can avoid creating `RDD`s too.
### Does this PR introduce _any_ user-facing change?
No, it is an optimization.
### How was this patch tested?
Manually tested. Benchmark is TBD.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]