[
https://issues.apache.org/jira/browse/DRILL-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862171#comment-15862171
]
Paul Rogers edited comment on DRILL-5254 at 2/11/17 2:41 AM:
-------------------------------------------------------------
The enhanced rules are enabled (and customized!) using boot-time config options.
{code}
optimizer: {
enhanced_defaults: {
// If true, uses the enhanced Drill reduction factor rules.
// If false, uses the default Calcite rules.
enable: false,
// p(a = value)
// Probability that a column equals some value.
// Default is 0.15 in Calcite, 0.10 in most textbooks (and System-R)
prob_eq: 0.15,
// p(a IS NULL)
// Calcite defines p(a NOT NULL) as 0.9
prob_null: 0.10,
// p(a LIKE value)
// Calcite default is 0.25
prob_like: 0.25
}
},
{code}
The rules are off by default until we do sufficient testing.
was (Author: paul-rogers):
The enhanced rules are enabled (and customized!) using boot-time config options.
{code}
optimizer: {
enhanced_defaults: {
// If true, uses the enhanced Drill reduction factor rules.
// If false, uses the default Calcite rules.
enable: false,
// p(a = value)
// Probability that a column equals some value.
// Default is 0.15 in Calcite, 0.10 in most textbooks (and System-R)
prob_eq: 0.15,
// p(a IS NULL)
// Calcite defines p(a NOT NULL) as 0.9
prob_is_null: 0.10,
// p(a LIKE value)
// Calcite default is 0.25
prob_like: 0.25
}
},
{code}
The rules are off by default until we do sufficient testing.
> Enhance default reduction factors in optimizer
> ----------------------------------------------
>
> Key: DRILL-5254
> URL: https://issues.apache.org/jira/browse/DRILL-5254
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.9.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Fix For: 1.10
>
>
> Drill uses Calcite for query parsing and optimization. Drill uses Calcite's
> default selectivity (reduction factor) rules to compute the number of rows
> removed by a filter.
> The default rules appear to be overly aggressive in estimating reductions. In
> a production use case, an input with 4 billion rows was estimated to return
> just 40K rows from a filter. That is, the filter estimated a 1/1,000,000
> reduction in rows. As it turns out, the actual reduction was closer to 1/2.
> The result was that the planner compared the expected 40K rows against
> another input of 2.5 million rows, and decided the 40K rows would be best on
> the build side of a hash join. When confronted with the actual 3 billion
> rows, the hash join ran out of memory.
> The moral of the story is that, in Drill, it is worth being conservative when
> planning for memory-intensive operations.
> The (sanitized) filter is the following, annotated with (a guess at) the
> default reduction factors in each term:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
> 'Value5','Value6','Value7','Value8','Value9') -- 25%
> AND col2_i <=3 -- 25%
> AND col3_s1 = 'Y' -- 15%
> AND col4_s1 = 'Y' -- 15%
> AND col5_s6 not like '%str1%' -- 25%
> AND col5_s6 not like '%str2%' -- 25%
> AND col5_s6 not like '%str3%' -- 25%
> AND col5_s6 not like '%str4%' -- 25%
> {code}
> Total reduction is something like:
> {code}
> .25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
> {code}
> Filter estimation is a known hard problem. In general, one needs statistics
> and other data, and even then the estimates are just guesses.
> Still it is possible to ensure that the defaults are at least unbiased. That
> is if we assume that the probability of A LIKE B being 25%, then the
> probability of A NOT LIKE B should be 75%, not also 25%.
> This JIRA suggests creating an experimental set of defaults based on the
> "core" Calcite defaults, but with other reduction factors derived using the
> laws of probability. In particular:
> || Operator || Revised || Explanation || Calcite Default
> | = | 0.15 | Default in Calcite | 0.15
> | <> | 0.85 | 1 - p(=) | 0.5
> | < | 0.425 | p(<>) / 2 | 0.5
> | > | 0.425 | p(<>) / 2 | 0.5
> | <= | 0.575 | p(<) + p(=) | 0.5
> | >= | 0.575 | p(>) + p(=) | 0.5
> | LIKE | 0.25 | Default in Calcite | 0.25
> | NOT LIKE | 0.75 | 1 - p(LIKE) | 0.25
> | NOT NULL | 0.90 | Default in Calcite | 0.90
> | IS NULL | 0.10 | 1 - p(NOT NULL) | 0.25
> | IS TRUE | 0.5 | 1 / 2 | 0.25
> | IS FALSE | 0.5 | 1 / 2 | 0.25
> | IS NOT TRUE | 0.55 | 1 - p(IS TRUE) - p(IS NULL) | .25
> | IS NOT FALSE | 0.55 | 1 - p(IS FALSE) - p(IS NULL) | .25
> | A OR B | Varies | min(p(A) + p(B) - p(A ^ B), 0.5) | 0.5
> | IN (a) | 0.15 | p(=) | 0.5
> | x IN (a, b, c, ...) | Varies | p(x = a v x = b v x = c v ...) | 0.5
> The Calcite defaults should be taken as approximate.
> The probability of the IS NOT TRUE statement assumes the presence of nulls,
> while IS TRUE does not. The rule for OR caps the reduction factor at 0.5 per
> standard practice.
> With the revised rules, the example WHERE reduction becomes:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
> 'Value5','Value6','Value7','Value8','Value9') -- 50%
> AND col2_i <=3 -- 57%
> AND col3_s1 = 'Y' -- 15%
> AND col4_s1 = 'Y' -- 15%
> AND col5_s6 not like '%str1%' -- 85%
> AND col5_s6 not like '%str2%' -- 85%
> AND col5_s6 not like '%str3%' -- 85%
> AND col5_s6 not like '%str4%' -- 85%
> .5 * .57 * .15^2 * .85^4 = 0.003
> {code}
> The new rules are not a panacea: they are still just guesses. However, they
> are unbiased guesses based on the rules of probability which result in more
> conservative reductions of filters. The result may be better plans in queries
> with large conjunctions (large number of expressions AND'ed together.)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)