Dustin Smith created SPARK-54593:
------------------------------------
Summary: Expand DPP for LogicalRelation and LogicalRDD
Key: SPARK-54593
URL: https://issues.apache.org/jira/browse/SPARK-54593
Project: Spark
Issue Type: Improvement
Components: Optimizer, SQL
Affects Versions: 4.0.0, 3.0.0
Reporter: Dustin Smith
Enable Dynamic Partition Pruning (DPP) for queries involving LocalRelation and
LogicalRDD as the filtering side of broadcast joins. These logical plan nodes
represent small, materialized datasets that are ideal candidates for DPP
optimization.
# PartitionPruning.scala
** Modified hasSelectivePredicate() to treat LocalRelation and LogicalRDD as
selective predicates
** Added overhead calculation for LocalRelation and LogicalRDD in
calculatePlanOverhead()
** LogicalRDD is only considered for DPP when it has originStats with
rowCount, indicating materialized data
# Test Coverage
** Added 5 comprehensive tests in DynamicPartitionPruningSuite.scala:
*** DPP with LocalRelation in broadcast join
*** DPP with LogicalRDD from cached DataFrame in broadcast join
*** DPP with empty LocalRelation
*** DPP should not trigger for LogicalRDD without originStats
*** DPP with large LocalRelation
Will supply a Github PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]