[PR] [WIP][SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]

via GitHub Mon, 16 Dec 2024 14:27:05 -0800


ahshahid opened a new pull request, #49209:
URL: https://github.com/apache/spark/pull/49209


   …m side join key on non partition columns
   ### What changes were proposed in this pull request?
   On the lines of DPP which helps DataSourceV2 relations when the joining key 
is a partition column, the same concept can be extended over to the case where 
joining key is not a partition column. In this PR, the keys available in the 
BroadcastHashJoinExec are pushed down to the DataSourceV2 scans in form of a 
SortedSet structure. For non partition columns, the DataSources like iceberg 
have max/min stats on columns available at manifest level, and for formats like 
parquet , they have max/min stats at various data structure levels. The passed 
SortedSet can be used to prune using ranges at both driver level ( manifests 
files) as well as executor level ( while actually going through chunks , row 
groups etc at parquet level)
   If the data is stored as Columnar Batch format , then it would not be 
possible to filter out individual row at DataSource level, even though we have 
keys. But at the scan level, ( ColumnToRowExec) it is still possible to filter 
out as many rows as possible , if the query involves nested joins. Thus 
reducing the number of rows to join at the higher join levels.
   Attaching link to a presentation which outlines the idea: [Broadcast Keys 
pushdown](https://docs.google.com/presentation/d/165Rx7i00TmAKnDJpSQLfrcrW-ShrzPy5/edit?usp=drive_link)
   SPIP : [SPIP-44662](https://issues.apache.org/jira/browse/SPARK-44662)
   
   
   ### Why are the changes needed?
   There is scope of improvement in the performance of Inner and Left Semi join 
queries when using BroadcastHashJoin
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Ran TPCDS suite using iceberg as DataSource. Converted many of the existing 
Spark Query tests to also run using iceberg as data source. Will be adding more 
unit tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP][SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]

Reply via email to