I just looked at the PR. I think there are some follow up work that needs to be 
done, e.g. we shouldn't create a top level package 
org.apache.spark.sql.dynamicpruning.

On Wed, Oct 02, 2019 at 1:52 PM, Maryann Xue < maryann....@databricks.com > 
wrote:

> 
> There is no internal write up, but I think we should at least give some
> up-to-date description on that JIRA entry.
> 
> On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin < r...@databricks.com > wrote:
> 
> 
>> No there is no separate write up internally.
>> 
>> On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue < rb...@netflix.com > wrote:
>> 
>> 
>>> Thanks for the pointers, but what I'm looking for is information about the
>>> design of this implementation, like what requires this to be in spark-sql
>>> instead of spark-catalyst.
>>> 
>>> 
>>> Even a high-level description, like what the optimizer rules are and what
>>> they do would be great. Was there one written up internally that you could
>>> share?
>>> 
>>> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue < maryann....@databricks.com >
>>> wrote:
>>> 
>>> 
>>>> > It lists 3 cases for how a filter is built, but nothing about the
>>>> overall approach or design that helps when trying to find out where it
>>>> should be placed in the optimizer rules.
>>>> 
>>>> 
>>>> The overall idea/design of DPP can be simply put as using the result of
>>>> one side of the join to prune partitions of a scan on the other side. The
>>>> optimal situation is when the join is a broadcast join and the table being
>>>> partition-pruned is on the probe side. In that case, by the time the probe
>>>> side starts, the filter will already have the results available and ready
>>>> for reuse.
>>>> 
>>>> 
>>>> Regarding the place in the optimizer rules, it's preferred to happen late
>>>> in the optimization, and definitely after join reorder.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Maryann
>>>> 
>>>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin < r...@databricks.com > wrote:
>>>> 
>>>> 
>>>> 
>>>>> Whoever created the JIRA years ago didn't describe dpp correctly, but the
>>>>> linked jira in Hive was correct (which unfortunately is much more terse
>>>>> than any of the patches we have in Spark 
>>>>> https://issues.apache.org/jira/browse/HIVE-9152
>>>>> ). Henry R's description was also correct.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue < rb...@netflix.com.invalid > 
>>>>> wrote:
>>>>> 
>>>>> 
>>>>>> Where can I find a design doc for dynamic partition pruning that explains
>>>>>> how it works?
>>>>>> 
>>>>>> 
>>>>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
>>>>>> pruning (as pointed out by Henry R.) and doesn't have any comments about
>>>>>> the implementation's approach. And the PR description also doesn't have
>>>>>> much information. It lists 3 cases for how a filter is built, but nothing
>>>>>> about the overall approach or design that helps when trying to find out
>>>>>> where it should be placed in the optimizer rules. It also isn't clear why
>>>>>> this couldn't be part of spark-catalyst.
>>>>>> 
>>>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan < cloud0...@gmail.com > wrote:
>>>>>> 
>>>>>> 
>>>>>>> dynamic partition pruning rule generates "hidden" filters that will be
>>>>>>> converted to real predicates at runtime, so it doesn't matter where we 
>>>>>>> run
>>>>>>> the rule.
>>>>>>> 
>>>>>>> 
>>>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's 
>>>>>>> better
>>>>>>> to run it before join reorder.
>>>>>>> 
>>>>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue < rb...@netflix.com.invalid >
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi everyone,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I have been working on a PR that moves filter and projection pushdown 
>>>>>>>> into
>>>>>>>> the optimizer for DSv2, instead of when converting to physical plan. 
>>>>>>>> This
>>>>>>>> will make DSv2 work with optimizer rules that depend on stats, like 
>>>>>>>> join
>>>>>>>> reordering.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> While adding the optimizer rule, I found that some rules appear to be 
>>>>>>>> out
>>>>>>>> of order. For example, PruneFileSourcePartitions that handles filter
>>>>>>>> pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that 
>>>>>>>> will
>>>>>>>> run after all of the batches in Optimizer (spark-catalyst) including 
>>>>>>>> CostBasedJoinReorder
>>>>>>>> .
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules 
>>>>>>>> after both
>>>>>>>> the cost-based join reordering and the v1 partition pruning rule. I’m 
>>>>>>>> not
>>>>>>>> sure why this should run after join reordering and partition pruning,
>>>>>>>> since it seems to me like additional filters would be good to have 
>>>>>>>> before
>>>>>>>> those rules run.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> It looks like this might just be that the rules were written in the
>>>>>>>> spark-sql module instead of in catalyst. That makes some sense for the 
>>>>>>>> v1
>>>>>>>> pushdown, which is altering physical plan details ( FileIndex ) that 
>>>>>>>> have
>>>>>>>> leaked into the logical plan. I’m not sure why the dynamic partition
>>>>>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>>>>>>>> pushdown.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Can someone more familiar with these rules clarify why they appear to 
>>>>>>>> be
>>>>>>>> out of order?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Assuming that this is an accident, I think it’s something that should 
>>>>>>>> be
>>>>>>>> fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning
>>>>>>>> may still need to be addressed.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> rb
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
> 
>

Reply via email to