[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471311#comment-16471311 ]
Henry Robinson commented on SPARK-11150: ---------------------------------------- The title of this JIRA is 'dynamic partition pruning', but the examples are a) not related to dynamic partition pruning and b) work as expected in Spark 2.3. Spark will correctly infer that, given {{t1.foo = t2.bar AND t2.bar = 1}}, that {{t1.foo = 1}}. It will prune partitions statically - at compile time - and that is reflected in the scan. _Dynamic_ partition pruning is about pruning partitions based on information that can only be inferred at run time. A typical example is: {{SELECT * FROM dim_table JOIN fact_table ON (dim_table.partcol = fact_table.partcol) WHERE dim_table.othercol > 10}}. Little can be inferred from the query at compilation time about what partitions to scan in {{fact_table}} (except that only the intersection between {{fact_table}} and {{part_table}}'s partitions should be scanned). However at run time, the set of partition keys produced by scanning {{dim_table}} with the filter predicate can be recorded - usually at the join node - and sent to the probe side of the join (in this case {{fact_table}}). The scan of {{fact_table}} can use that set to filter out any partitions that aren't in the build side of the join, because they wouldn't match any rows during the join. Hive and Impala both support this kind of partition filtering (and it doesn't only have to apply to partitions - you can filter the rows as well if evaluating the predicate isn't too expensive). The challenges are: * making sure that the representation chosen for the filters is compact enough to be shuffled around all the executors that might be performing the scan task, while having a low false-positive rate * adding the logic to the planner to detect these opportunities * optionally disabling the filtering if it's not being selective enough * coordinating amongst the build and probe side to ensure that the latter waits for the former (this is a bit easier in Spark because it's not a pipelined execution model) Do we agree that this JIRA should be more explicitly made about dynamic partition pruning, or is that tracked elsewhere? If so, I propose closing this one; otherwise I can edit this one's description. > Dynamic partition pruning > ------------------------- > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 > Reporter: Younes > Priority: Major > > Partitions are not pruned when joined on the partition columns. > This is the same issue as HIVE-9152. > Ex: > Select .... from tab where partcol=1 will prune on value 1 > Select .... from tab join dim on (dim.partcol=tab.partcol) where > dim.partcol=1 will scan all partitions. > Tables are based on parquets. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org