[jira] [Commented] (SPARK-11150) Dynamic partition pruning

Henry Robinson (JIRA) Thu, 10 May 2018 17:11:40 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471311#comment-16471311
 ]


Henry Robinson commented on SPARK-11150:
----------------------------------------

The title of this JIRA is 'dynamic partition pruning', but the examples are a) 
not related to dynamic partition pruning and b) work as expected in Spark 2.3.

Spark will correctly infer that, given {{t1.foo = t2.bar AND t2.bar = 1}}, that 
{{t1.foo = 1}}. It will prune partitions statically - at compile time - and 
that is reflected in the scan.

_Dynamic_ partition pruning is about pruning partitions based on information 
that can only be inferred at run time. A typical example is:

{{SELECT * FROM dim_table JOIN fact_table ON (dim_table.partcol = 
fact_table.partcol) WHERE dim_table.othercol > 10}}.

Little can be inferred from the query at compilation time about what partitions 
to scan in {{fact_table}} (except that only the intersection between 
{{fact_table}} and {{part_table}}'s partitions should be scanned). 

However at run time, the set of partition keys produced by scanning 
{{dim_table}} with the filter predicate can be recorded - usually at the join 
node - and sent to the probe side of the join (in this case {{fact_table}}). 
The scan of {{fact_table}} can use that set to filter out any partitions that 
aren't in the build side of the join, because they wouldn't match any rows 
during the join. Hive and Impala both support this kind of partition filtering 
(and it doesn't only have to apply to partitions - you can filter the rows as 
well if evaluating the predicate isn't too expensive).

The challenges are:

* making sure that the representation chosen for the filters is compact enough 
to be shuffled around all the executors that might be performing the scan task, 
while having a low false-positive rate
* adding the logic to the planner to detect these opportunities
* optionally disabling the filtering if it's not being selective enough
* coordinating amongst the build and probe side to ensure that the latter waits 
for the former (this is a bit easier in Spark because it's not a pipelined 
execution model)

Do we agree that this JIRA should be more explicitly made about dynamic 
partition pruning, or is that tracked elsewhere? If so, I propose closing this 
one; otherwise I can edit this one's description.



> Dynamic partition pruning
> -------------------------
>
>                 Key: SPARK-11150
>                 URL: https://issues.apache.org/jira/browse/SPARK-11150
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>            Reporter: Younes
>            Priority: Major
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select .... from tab where partcol=1 will prune on value 1
> Select .... from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11150) Dynamic partition pruning

Reply via email to