> It lists 3 cases for how a filter is built, but nothing about the overall
approach or design that helps when trying to find out where it should be
placed in the optimizer rules.

The overall idea/design of DPP can be simply put as using the result of one
side of the join to prune partitions of a scan on the other side. The
optimal situation is when the join is a broadcast join and the table being
partition-pruned is on the probe side. In that case, by the time the probe
side starts, the filter will already have the results available and ready
for reuse.

Regarding the place in the optimizer rules, it's preferred to happen late
in the optimization, and definitely after join reorder.


Thanks,
Maryann

On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <r...@databricks.com> wrote:

> Whoever created the JIRA years ago didn't describe dpp correctly, but the
> linked jira in Hive was correct (which unfortunately is much more terse
> than any of the patches we have in Spark
> https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description
> was also correct.
>
>
>
>
>
> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Where can I find a design doc for dynamic partition pruning that explains
>> how it works?
>>
>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
>> pruning (as pointed out by Henry R.) and doesn't have any comments about
>> the implementation's approach. And the PR description also doesn't have
>> much information. It lists 3 cases for how a filter is built, but
>> nothing about the overall approach or design that helps when trying to find
>> out where it should be placed in the optimizer rules. It also isn't clear
>> why this couldn't be part of spark-catalyst.
>>
>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> dynamic partition pruning rule generates "hidden" filters that will be
>>> converted to real predicates at runtime, so it doesn't matter where we run
>>> the rule.
>>>
>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's
>>> better to run it before join reorder.
>>>
>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I have been working on a PR that moves filter and projection pushdown
>>>> into the optimizer for DSv2, instead of when converting to physical plan.
>>>> This will make DSv2 work with optimizer rules that depend on stats, like
>>>> join reordering.
>>>>
>>>> While adding the optimizer rule, I found that some rules appear to be
>>>> out of order. For example, PruneFileSourcePartitions that handles
>>>> filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a
>>>> batch that will run after all of the batches in Optimizer
>>>> (spark-catalyst) including CostBasedJoinReorder.
>>>>
>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules
>>>> *after* both the cost-based join reordering and the v1 partition
>>>> pruning rule. I’m not sure why this should run after join reordering and
>>>> partition pruning, since it seems to me like additional filters would be
>>>> good to have before those rules run.
>>>>
>>>> It looks like this might just be that the rules were written in the
>>>> spark-sql module instead of in catalyst. That makes some sense for the v1
>>>> pushdown, which is altering physical plan details (FileIndex) that
>>>> have leaked into the logical plan. I’m not sure why the dynamic partition
>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>>>> pushdown.
>>>>
>>>> Can someone more familiar with these rules clarify why they appear to
>>>> be out of order?
>>>>
>>>> Assuming that this is an accident, I think it’s something that should
>>>> be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning
>>>> may still need to be addressed.
>>>>
>>>> rb
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

Reply via email to