I just looked at the PR. I think there are some follow up work that needs to be done, e.g. we shouldn't create a top level package org.apache.spark.sql.dynamicpruning.
On Wed, Oct 02, 2019 at 1:52 PM, Maryann Xue < maryann....@databricks.com > wrote: > > There is no internal write up, but I think we should at least give some > up-to-date description on that JIRA entry. > > On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin < r...@databricks.com > wrote: > > >> No there is no separate write up internally. >> >> On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue < rb...@netflix.com > wrote: >> >> >>> Thanks for the pointers, but what I'm looking for is information about the >>> design of this implementation, like what requires this to be in spark-sql >>> instead of spark-catalyst. >>> >>> >>> Even a high-level description, like what the optimizer rules are and what >>> they do would be great. Was there one written up internally that you could >>> share? >>> >>> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue < maryann....@databricks.com > >>> wrote: >>> >>> >>>> > It lists 3 cases for how a filter is built, but nothing about the >>>> overall approach or design that helps when trying to find out where it >>>> should be placed in the optimizer rules. >>>> >>>> >>>> The overall idea/design of DPP can be simply put as using the result of >>>> one side of the join to prune partitions of a scan on the other side. The >>>> optimal situation is when the join is a broadcast join and the table being >>>> partition-pruned is on the probe side. In that case, by the time the probe >>>> side starts, the filter will already have the results available and ready >>>> for reuse. >>>> >>>> >>>> Regarding the place in the optimizer rules, it's preferred to happen late >>>> in the optimization, and definitely after join reorder. >>>> >>>> >>>> >>>> >>>> Thanks, >>>> Maryann >>>> >>>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin < r...@databricks.com > wrote: >>>> >>>> >>>> >>>>> Whoever created the JIRA years ago didn't describe dpp correctly, but the >>>>> linked jira in Hive was correct (which unfortunately is much more terse >>>>> than any of the patches we have in Spark >>>>> https://issues.apache.org/jira/browse/HIVE-9152 >>>>> ). Henry R's description was also correct. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue < rb...@netflix.com.invalid > >>>>> wrote: >>>>> >>>>> >>>>>> Where can I find a design doc for dynamic partition pruning that explains >>>>>> how it works? >>>>>> >>>>>> >>>>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition >>>>>> pruning (as pointed out by Henry R.) and doesn't have any comments about >>>>>> the implementation's approach. And the PR description also doesn't have >>>>>> much information. It lists 3 cases for how a filter is built, but nothing >>>>>> about the overall approach or design that helps when trying to find out >>>>>> where it should be placed in the optimizer rules. It also isn't clear why >>>>>> this couldn't be part of spark-catalyst. >>>>>> >>>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan < cloud0...@gmail.com > wrote: >>>>>> >>>>>> >>>>>>> dynamic partition pruning rule generates "hidden" filters that will be >>>>>>> converted to real predicates at runtime, so it doesn't matter where we >>>>>>> run >>>>>>> the rule. >>>>>>> >>>>>>> >>>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's >>>>>>> better >>>>>>> to run it before join reorder. >>>>>>> >>>>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue < rb...@netflix.com.invalid > >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I have been working on a PR that moves filter and projection pushdown >>>>>>>> into >>>>>>>> the optimizer for DSv2, instead of when converting to physical plan. >>>>>>>> This >>>>>>>> will make DSv2 work with optimizer rules that depend on stats, like >>>>>>>> join >>>>>>>> reordering. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> While adding the optimizer rule, I found that some rules appear to be >>>>>>>> out >>>>>>>> of order. For example, PruneFileSourcePartitions that handles filter >>>>>>>> pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that >>>>>>>> will >>>>>>>> run after all of the batches in Optimizer (spark-catalyst) including >>>>>>>> CostBasedJoinReorder >>>>>>>> . >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules >>>>>>>> after both >>>>>>>> the cost-based join reordering and the v1 partition pruning rule. I’m >>>>>>>> not >>>>>>>> sure why this should run after join reordering and partition pruning, >>>>>>>> since it seems to me like additional filters would be good to have >>>>>>>> before >>>>>>>> those rules run. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> It looks like this might just be that the rules were written in the >>>>>>>> spark-sql module instead of in catalyst. That makes some sense for the >>>>>>>> v1 >>>>>>>> pushdown, which is altering physical plan details ( FileIndex ) that >>>>>>>> have >>>>>>>> leaked into the logical plan. I’m not sure why the dynamic partition >>>>>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate >>>>>>>> pushdown. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Can someone more familiar with these rules clarify why they appear to >>>>>>>> be >>>>>>>> out of order? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Assuming that this is an accident, I think it’s something that should >>>>>>>> be >>>>>>>> fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning >>>>>>>> may still need to be addressed. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> rb >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> > >