[ https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566271#comment-17566271 ]
Victor Delépine commented on SPARK-39753: ----------------------------------------- [~yumwang] I'm not too familiar with build side and Spark SQL internals, unfortunately :). Do you think this change is a good idea? Let me know if there's anything I can do to help move this forward! > Broadcast joins should pushdown join constraints as Filter to the larger > relation > --------------------------------------------------------------------------------- > > Key: SPARK-39753 > URL: https://issues.apache.org/jira/browse/SPARK-39753 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0, 3.2.1, 3.3.0 > Reporter: Victor Delépine > Priority: Major > > SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to > re-open it here for more visibility, since I believe this bug has a major > impact and that fixing it could drastically improve the performance of many > pipelines. > Allow me to paste the initial description again here: > _For broadcast inner-joins, where the smaller relation is known to be small > enough to materialize on a worker, the set of values for all join columns is > known and fits in memory. Spark should translate these values into a > {{Filter}} pushed down to the datasource. The common join condition of > equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} > clause. An example of pushing such filters is already present in the form of > {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_} > _This optimization could even work when the smaller relation does not fit > entirely in memory. This could be done by partitioning the smaller relation > into N pieces, applying this predicate pushdown for each piece, and unioning > the results._ > > Essentially, when doing a Broadcast join, the smaller side can be used to > filter down the bigger side before performing the join. As of today, the join > will read all partitions of the bigger side, without pruning partitions -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org