[jira] [Comment Edited] (DRILL-5254) Enhance default reduction factors in optimizer

Paul Rogers (JIRA) Fri, 10 Feb 2017 18:42:07 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862171#comment-15862171
 ]


Paul Rogers edited comment on DRILL-5254 at 2/11/17 2:41 AM:
-------------------------------------------------------------

The enhanced rules are enabled (and customized!) using boot-time config options.

{code}
  optimizer: {
    enhanced_defaults: {
      // If true, uses the enhanced Drill reduction factor rules.
      // If false, uses the default Calcite rules.
      enable: false,
      // p(a = value)
      // Probability that a column equals some value.
      // Default is 0.15 in Calcite, 0.10 in most textbooks (and System-R)
      prob_eq: 0.15,
      // p(a IS NULL)
      // Calcite defines p(a NOT NULL) as 0.9
      prob_null: 0.10,
      // p(a LIKE value)
      // Calcite default is 0.25
      prob_like: 0.25
    }
  },
{code}

The rules are off by default until we do sufficient testing.


was (Author: paul-rogers):
The enhanced rules are enabled (and customized!) using boot-time config options.

{code}
  optimizer: {
    enhanced_defaults: {
      // If true, uses the enhanced Drill reduction factor rules.
      // If false, uses the default Calcite rules.
      enable: false,
      // p(a = value)
      // Probability that a column equals some value.
      // Default is 0.15 in Calcite, 0.10 in most textbooks (and System-R)
      prob_eq: 0.15,
      // p(a IS NULL)
      // Calcite defines p(a NOT NULL) as 0.9
      prob_is_null: 0.10,
      // p(a LIKE value)
      // Calcite default is 0.25
      prob_like: 0.25
    }
  },
{code}

The rules are off by default until we do sufficient testing.

> Enhance default reduction factors in optimizer
> ----------------------------------------------
>
>                 Key: DRILL-5254
>                 URL: https://issues.apache.org/jira/browse/DRILL-5254
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.10
>
>
> Drill uses Calcite for query parsing and optimization. Drill uses Calcite's 
> default selectivity (reduction factor) rules to compute the number of rows 
> removed by a filter.
> The default rules appear to be overly aggressive in estimating reductions. In 
> a production use case, an input with 4 billion rows was estimated to return 
> just 40K rows from a filter. That is, the filter estimated a 1/1,000,000 
> reduction in rows. As it turns out, the actual reduction was closer to 1/2.
> The result was that the planner compared the expected 40K rows against 
> another input of 2.5 million rows, and decided the 40K rows would be best on 
> the build side of a hash join. When confronted with the actual 3 billion 
> rows, the hash join ran out of memory.
> The moral of the story is that, in Drill, it is worth being conservative when 
> planning for memory-intensive operations.
> The (sanitized) filter is the following, annotated with (a guess at) the 
> default reduction factors in each term:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
>                      'Value5','Value6','Value7','Value8','Value9') -- 25%
> AND col2_i <=3 -- 25%
> AND col3_s1 = 'Y' -- 15%
> AND col4_s1 = 'Y' -- 15%
> AND col5_s6 not like '%str1%' -- 25%
> AND col5_s6 not like '%str2%' -- 25%
> AND col5_s6 not like '%str3%' -- 25%
> AND col5_s6 not like '%str4%' -- 25%
> {code}
> Total reduction is something like:
> {code}
> .25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
> {code}
> Filter estimation is a known hard problem. In general, one needs statistics 
> and other data, and even then the estimates are just guesses.
> Still it is possible to ensure that the defaults are at least unbiased. That 
> is if we assume that the probability of A LIKE B being 25%, then the 
> probability of A NOT LIKE B should be 75%, not also 25%.
> This JIRA suggests creating an experimental set of defaults based on the 
> "core" Calcite defaults, but with other reduction factors derived using the 
> laws of probability. In particular:
> || Operator || Revised || Explanation || Calcite Default
> | = | 0.15 | Default in Calcite | 0.15
> | <> | 0.85 | 1 - p(=) | 0.5
> | < | 0.425 | p(<>) / 2 | 0.5
> | > | 0.425 | p(<>) / 2 | 0.5
> | <= | 0.575 | p(<) + p(=) | 0.5
> | >= | 0.575 | p(>) + p(=) | 0.5
> | LIKE | 0.25 | Default in Calcite | 0.25
> | NOT LIKE | 0.75 | 1 - p(LIKE) | 0.25
> | NOT NULL | 0.90 | Default in Calcite | 0.90
> | IS NULL | 0.10 | 1 - p(NOT NULL) | 0.25
> | IS TRUE | 0.5 | 1 / 2 | 0.25
> | IS FALSE | 0.5 | 1 / 2 | 0.25
> | IS NOT TRUE | 0.55 | 1 - p(IS TRUE) - p(IS NULL) | .25
> | IS NOT FALSE | 0.55 | 1 - p(IS FALSE) - p(IS NULL) | .25
> | A OR B | Varies | min(p(A) + p(B) - p(A ^ B), 0.5) | 0.5
> | IN (a) | 0.15 | p(=) | 0.5
> | x IN (a, b, c, ...) | Varies | p(x = a v x = b v x = c v ...) | 0.5
> The Calcite defaults should be taken as approximate.
> The probability of the IS NOT TRUE statement assumes the presence of nulls, 
> while IS TRUE does not. The rule for OR caps the reduction factor at 0.5 per 
> standard practice.
> With the revised rules, the example WHERE reduction becomes:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
>                      'Value5','Value6','Value7','Value8','Value9') -- 50%
> AND col2_i <=3 -- 57%
> AND col3_s1 = 'Y' -- 15%
> AND col4_s1 = 'Y' -- 15%
> AND col5_s6 not like '%str1%' -- 85%
> AND col5_s6 not like '%str2%' -- 85%
> AND col5_s6 not like '%str3%' -- 85%
> AND col5_s6 not like '%str4%' -- 85%
> .5 * .57 * .15^2 * .85^4 = 0.003
> {code}
> The new rules are not a panacea: they are still just guesses. However, they 
> are unbiased guesses based on the rules of probability which result in more 
> conservative reductions of filters. The result may be better plans in queries 
> with large conjunctions (large number of expressions AND'ed together.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (DRILL-5254) Enhance default reduction factors in optimizer

Reply via email to