No there is no separate write up internally. On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue <rb...@netflix.com> wrote:
> Thanks for the pointers, but what I'm looking for is information about the > design of this implementation, like what requires this to be in spark-sql > instead of spark-catalyst. > > Even a high-level description, like what the optimizer rules are and what > they do would be great. Was there one written up internally that you could > share? > > On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <maryann....@databricks.com> > wrote: > >> > It lists 3 cases for how a filter is built, but nothing about the >> overall approach or design that helps when trying to find out where it >> should be placed in the optimizer rules. >> >> The overall idea/design of DPP can be simply put as using the result of >> one side of the join to prune partitions of a scan on the other side. The >> optimal situation is when the join is a broadcast join and the table being >> partition-pruned is on the probe side. In that case, by the time the probe >> side starts, the filter will already have the results available and ready >> for reuse. >> >> Regarding the place in the optimizer rules, it's preferred to happen late >> in the optimization, and definitely after join reorder. >> >> >> Thanks, >> Maryann >> >> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <r...@databricks.com> wrote: >> >>> Whoever created the JIRA years ago didn't describe dpp correctly, but >>> the linked jira in Hive was correct (which unfortunately is much more terse >>> than any of the patches we have in Spark >>> https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description >>> was also correct. >>> >>> >>> >>> >>> >>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> Where can I find a design doc for dynamic partition pruning that >>>> explains how it works? >>>> >>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition >>>> pruning (as pointed out by Henry R.) and doesn't have any comments about >>>> the implementation's approach. And the PR description also doesn't have >>>> much information. It lists 3 cases for how a filter is built, but >>>> nothing about the overall approach or design that helps when trying to find >>>> out where it should be placed in the optimizer rules. It also isn't clear >>>> why this couldn't be part of spark-catalyst. >>>> >>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>>> >>>>> dynamic partition pruning rule generates "hidden" filters that will be >>>>> converted to real predicates at runtime, so it doesn't matter where we run >>>>> the rule. >>>>> >>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's >>>>> better to run it before join reorder. >>>>> >>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I have been working on a PR that moves filter and projection pushdown >>>>>> into the optimizer for DSv2, instead of when converting to physical plan. >>>>>> This will make DSv2 work with optimizer rules that depend on stats, like >>>>>> join reordering. >>>>>> >>>>>> While adding the optimizer rule, I found that some rules appear to be >>>>>> out of order. For example, PruneFileSourcePartitions that handles >>>>>> filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a >>>>>> batch that will run after all of the batches in Optimizer >>>>>> (spark-catalyst) including CostBasedJoinReorder. >>>>>> >>>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules >>>>>> *after* both the cost-based join reordering and the v1 partition >>>>>> pruning rule. I’m not sure why this should run after join reordering and >>>>>> partition pruning, since it seems to me like additional filters would be >>>>>> good to have before those rules run. >>>>>> >>>>>> It looks like this might just be that the rules were written in the >>>>>> spark-sql module instead of in catalyst. That makes some sense for the v1 >>>>>> pushdown, which is altering physical plan details (FileIndex) that >>>>>> have leaked into the logical plan. I’m not sure why the dynamic partition >>>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate >>>>>> pushdown. >>>>>> >>>>>> Can someone more familiar with these rules clarify why they appear to >>>>>> be out of order? >>>>>> >>>>>> Assuming that this is an accident, I think it’s something that should >>>>>> be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” >>>>>> pruning >>>>>> may still need to be addressed. >>>>>> >>>>>> rb >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> > > -- > Ryan Blue > Software Engineer > Netflix >