[ 
https://issues.apache.org/jira/browse/DRILL-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15863037#comment-15863037
 ] 

ASF GitHub Bot commented on DRILL-5254:
---------------------------------------

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/746

    Drill 5254: Enhance default reduction factors in optimizer

    Provides enhanced default selectivity rules as defined in DRILL-5254. 
Please see the JIRA entry for the details.
    
    The rules are selectable with a boot-time option and are disabled by 
default to allow full testing of the rules before they become enabled by 
default.
    
    Also provides a complete unit test. The test required enhancements to the 
new test framework; those enhancements are also included.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5254

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/746.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #746
    
----
commit 62edc37fcb4034fed6f502b34e20923b7b6e5e75
Author: Paul Rogers <[email protected]>
Date:   2017-02-12T23:25:02Z

    Feature implementation
    
    Provides enhanced default selectivity rules as defined in DRILL-5254.
    The rules are selectable with a boot-time option and are disabled by
    default to allow full testing of the rules before they become enabled
    for all.

commit 66c59a2fc535e668599bb40765074ee7d6000aa9
Author: Paul Rogers <[email protected]>
Date:   2017-02-12T23:31:37Z

    Unit tests and supporting code
    
    Provides unit tests for the feature.
    
    Required adding a boolean field type to the mock data source.
    
    This test relies on saving query profiles and analyzing them to detect
    selectivity. However, the CTTAS feature introduced a check that causes
    tests to fail if they save query profiles for analysis. Provides a
    workaround in the test framework to delete the locally stored files
    before each test that will preserve local state.
    
    Added comments to several test methods.

----


> Enhance default reduction factors in optimizer
> ----------------------------------------------
>
>                 Key: DRILL-5254
>                 URL: https://issues.apache.org/jira/browse/DRILL-5254
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.10
>
>
> Drill uses Calcite for query parsing and optimization. Drill uses Calcite's 
> default selectivity (reduction factor) rules to compute the number of rows 
> removed by a filter.
> The default rules appear to be overly aggressive in estimating reductions. In 
> a production use case, an input with 4 billion rows was estimated to return 
> just 40K rows from a filter. That is, the filter estimated a 1/1,000,000 
> reduction in rows. As it turns out, the actual reduction was closer to 1/2.
> The result was that the planner compared the expected 40K rows against 
> another input of 2.5 million rows, and decided the 40K rows would be best on 
> the build side of a hash join. When confronted with the actual 3 billion 
> rows, the hash join ran out of memory.
> The moral of the story is that, in Drill, it is worth being conservative when 
> planning for memory-intensive operations.
> The (sanitized) filter is the following, annotated with (a guess at) the 
> default reduction factors in each term:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
>                      'Value5','Value6','Value7','Value8','Value9') -- 25%
> AND col2_i <=3 -- 25%
> AND col3_s1 = 'Y' -- 15%
> AND col4_s1 = 'Y' -- 15%
> AND col5_s6 not like '%str1%' -- 25%
> AND col5_s6 not like '%str2%' -- 25%
> AND col5_s6 not like '%str3%' -- 25%
> AND col5_s6 not like '%str4%' -- 25%
> {code}
> Total reduction is something like:
> {code}
> .25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
> {code}
> Filter estimation is a known hard problem. In general, one needs statistics 
> and other data, and even then the estimates are just guesses.
> Still it is possible to ensure that the defaults are at least unbiased. That 
> is if we assume that the probability of A LIKE B being 25%, then the 
> probability of A NOT LIKE B should be 75%, not also 25%.
> This JIRA suggests creating an experimental set of defaults based on the 
> "core" Calcite defaults, but with other reduction factors derived using the 
> laws of probability. In particular:
> || Operator || Revised || Explanation || Calcite Default
> | = | 0.15 | Default in Calcite | 0.15
> | <> | 0.85 | 1 - p(=) | 0.5
> | < | 0.425 | p(<>) / 2 | 0.5
> | > | 0.425 | p(<>) / 2 | 0.5
> | <= | 0.575 | p(<) + p(=) | 0.5
> | >= | 0.575 | p(>) + p(=) | 0.5
> | LIKE | 0.25 | Default in Calcite | 0.25
> | NOT LIKE | 0.75 | 1 - p(LIKE) | 0.25
> | NOT NULL | 0.90 | Default in Calcite | 0.90
> | IS NULL | 0.10 | 1 - p(NOT NULL) | 0.25
> | IS TRUE | 0.5 | 1 / 2 | 0.25
> | IS FALSE | 0.5 | 1 / 2 | 0.25
> | IS NOT TRUE | 0.55 | 1 - p(IS TRUE) - p(IS NULL) | 0.25
> | IS NOT FALSE | 0.55 | 1 - p(IS FALSE) - p(IS NULL) | 0.25
> | A OR B | Varies | min(p(A) + p(B) - p(A ^ B), 0.5) | 0.5
> | A AND B | Varies | p(A ^ B) = p(A) * p(B) | Same
> | IN (a) | 0.15 | p(=) | 0.25
> | x IN (a, b, c, ...) | Varies | p(x = a v x = b v x = c v ...) | 0.25
> | NOT A | Varies | 1 - p(A) | 0.25
> | BETWEEN a AND b | 0.33 | p(<= ^ >=) | 0.25
> | NOT BETWEEN a AND b | 0.72 | 1 - p(BETWEEN) | 0.25
> The Calcite defaults were identified by inspection and verified by tests. The 
> Calcite rules make sense if one considers conditional probability: that the 
> user applied a particular expression to the data with the expectation that 
> given that data set, the expression matches 25% of the rows.
> The probability of the IS NOT TRUE statement assumes the presence of nulls, 
> while IS TRUE does not. The rule for OR caps the reduction factor at 0.5 per 
> standard practice. The rule for BETWEEN arises because Calcite (or Drill?) 
> rewrites {{a BETWEEN b AND c}} as {{a >= b AND a <= c}}.
> With the revised rules, the example WHERE reduction becomes:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
>                      'Value5','Value6','Value7','Value8','Value9') -- 50%
> AND col2_i <=3 -- 57%
> AND col3_s1 = 'Y' -- 15%
> AND col4_s1 = 'Y' -- 15%
> AND col5_s6 not like '%str1%' -- 85%
> AND col5_s6 not like '%str2%' -- 85%
> AND col5_s6 not like '%str3%' -- 85%
> AND col5_s6 not like '%str4%' -- 85%
> .5 * .57 * .15^2 * .85^4 = 0.003
> {code}
> The new rules are not a panacea: they are still just guesses. However, they 
> are unbiased guesses based on the rules of probability which result in more 
> conservative reductions of filters. The result may be better plans in queries 
> with large conjunctions (large number of expressions AND'ed together.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to