[ 
https://issues.apache.org/jira/browse/DRILL-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5254:
-------------------------------
    Description: 
Drill uses Calcite for query parsing and optimization. Drill uses Calcite's 
default selectivity (reduction factor) rules to compute the number of rows 
removed by a filter.

The default rules appear to be overly aggressive in estimating reductions. In a 
production use case, an input with 4 billion rows was estimated to return just 
40K rows from a filter. That is, the filter estimated a 1/1,000,000 reduction 
in rows. As it turns out, the actual reduction was closer to 1/2.

The result was that the planner compared the expected 40K rows against another 
input of 2.5 million rows, and decided the 40K rows would be best on the build 
side of a hash join. When confronted with the actual 3 billion rows, the hash 
join ran out of memory.

The moral of the story is that, in Drill, it is worth being conservative when 
planning for memory-intensive operations.

The (sanitized) filter is the following, annotated with (a guess at) the 
default reduction factors in each term:

{code}
col1_s20 in ('Value1','Value2','Value3','Value4',
                     'Value5','Value6','Value7','Value8','Value9') -- 25%
AND col2_i <=3 -- 25%
AND col3_s1 = 'Y' -- 15%
AND col4_s1 = 'Y' -- 15%
AND col5_s6 not like '%str1%' -- 25%
AND col5_s6 not like '%str2%' -- 25%
AND col5_s6 not like '%str3%' -- 25%
AND col5_s6 not like '%str4%' -- 25%
{code}

Total reduction is something like:

{code}
.25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
{code}

Filter estimation is a known hard problem. In general, one needs statistics and 
other data, and even then the estimates are just guesses.

Still it is possible to ensure that the defaults are at least unbiased. That is 
if we assume that the probability of A LIKE B being 25%, then the probability 
of A NOT LIKE B should be 75%, not also 25%.

This JIRA suggests creating an experimental set of defaults based on the "core" 
Calcite defaults, but with other reduction factors derived using the laws of 
probability. In particular:

|| Operator || Revised || Explanation || Calcite Default
| = | 0.15 | Default in Calcite | 0.15
| <> | 0.85 | 1 - p(=) | 0.5
| < | 0.425 | p(<>) / 2 | 0.5
| > | 0.425 | p(<>) / 2 | 0.5
| <= | 0.575 | p(<) + p(=) | 0.5
| >= | 0.575 | p(>) + p(=) | 0.5
| LIKE | 0.25 | Default in Calcite | 0.25
| NOT LIKE | 0.75 | 1 - p(LIKE) | 0.25
| NOT NULL | 0.90 | Default in Calcite | 0.90
| IS NULL | 0.10 | 1 - p(NOT NULL) | 0.25
| IS TRUE | 0.5 | 1 / 2 | 0.25
| IS FALSE | 0.5 | 1 / 2 | 0.25
| IS NOT TRUE | 0.55 | (1 - p(IS NULL)) / 2 | 0.25
| IS NOT FALSE | 0.55 | (1 - p(IS NULL)) / 2 | 0.25
| A OR B | Varies | min(p(A) + p(B), 0.5) | 0.5
| A AND B | Varies | p(A ^ B) = p(A) * p(B) | Same
| IN (a) | 0.15 | p(=) | 0.25
| x IN (a, b, c, ...) | Varies | p(x = a v x = b v x = c v ...) | 0.25
| NOT A | Varies | 1 - p(A) | 0.25
| BETWEEN a AND b | 0.33 | p(<= ^ >=) | 0.25
| NOT BETWEEN a AND b | 0.67 | 1 - p(BETWEEN) | 0.25

The Calcite defaults were identified by inspection and verified by tests. The 
Calcite rules make sense if one considers conditional probability: that the 
user applied a particular expression to the data with the expectation that 
given that data set, the expression matches 25% of the rows.

The probability of the IS NOT TRUE statement assumes the presence of nulls, 
while IS TRUE does not. This is an example of conditional probability. If the 
user asks if something is TRUE, we can assume that they are nor worried about 
nulls. If they ask if it is NOT FALSE, the only justification for such syntax 
is to include nulls. Since we assume nulls (when present) make up 10% of data, 
in such a case, half the non-null data is true, half is false.

Note that, in the proposed implementation, each expression in an IS TRUE or 
similar clause is considered as a constant. That is, the following deliver 
different reduction factors:

{code}
a = 10
(a = 10) IS TRUE
{code}

One could argue that the second should give the selectivity for "a = 10", but 
the code (at present) treats this the same as "b IS TRUE" where b just happens 
to be "a = 10".

The rule for OR caps the reduction factor at 0.5 per standard practice. We 
assume values are independent, so the reduction is simply the sum of the 
individual items. (If data consists of "a", "b", "c" and "d", then p("a") = 25% 
and p("a" | "b" | "c" | "d" ) = 100%.

The rule for BETWEEN arises because Calcite (or Drill?) rewrites {{a BETWEEN b 
AND c}} as {{a >= b AND a <= c}}.

With the revised rules, the example WHERE reduction becomes:

{code}
col1_s20 in ('Value1','Value2','Value3','Value4',
                     'Value5','Value6','Value7','Value8','Value9') -- 50%
AND col2_i <=3 -- 57%
AND col3_s1 = 'Y' -- 15%
AND col4_s1 = 'Y' -- 15%
AND col5_s6 not like '%str1%' -- 85%
AND col5_s6 not like '%str2%' -- 85%
AND col5_s6 not like '%str3%' -- 85%
AND col5_s6 not like '%str4%' -- 85%

.5 * .57 * .15^2 * .85^4 = 0.003
{code}

The new rules are not a panacea: they are still just guesses. However, they are 
unbiased guesses based on the rules of probability which result in more 
conservative reductions of filters. The result may be better plans in queries 
with large conjunctions (large number of expressions AND'ed together.)

  was:
Drill uses Calcite for query parsing and optimization. Drill uses Calcite's 
default selectivity (reduction factor) rules to compute the number of rows 
removed by a filter.

The default rules appear to be overly aggressive in estimating reductions. In a 
production use case, an input with 4 billion rows was estimated to return just 
40K rows from a filter. That is, the filter estimated a 1/1,000,000 reduction 
in rows. As it turns out, the actual reduction was closer to 1/2.

The result was that the planner compared the expected 40K rows against another 
input of 2.5 million rows, and decided the 40K rows would be best on the build 
side of a hash join. When confronted with the actual 3 billion rows, the hash 
join ran out of memory.

The moral of the story is that, in Drill, it is worth being conservative when 
planning for memory-intensive operations.

The (sanitized) filter is the following, annotated with (a guess at) the 
default reduction factors in each term:

{code}
col1_s20 in ('Value1','Value2','Value3','Value4',
                     'Value5','Value6','Value7','Value8','Value9') -- 25%
AND col2_i <=3 -- 25%
AND col3_s1 = 'Y' -- 15%
AND col4_s1 = 'Y' -- 15%
AND col5_s6 not like '%str1%' -- 25%
AND col5_s6 not like '%str2%' -- 25%
AND col5_s6 not like '%str3%' -- 25%
AND col5_s6 not like '%str4%' -- 25%
{code}

Total reduction is something like:

{code}
.25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
{code}

Filter estimation is a known hard problem. In general, one needs statistics and 
other data, and even then the estimates are just guesses.

Still it is possible to ensure that the defaults are at least unbiased. That is 
if we assume that the probability of A LIKE B being 25%, then the probability 
of A NOT LIKE B should be 75%, not also 25%.

This JIRA suggests creating an experimental set of defaults based on the "core" 
Calcite defaults, but with other reduction factors derived using the laws of 
probability. In particular:

|| Operator || Revised || Explanation || Calcite Default
| = | 0.15 | Default in Calcite | 0.15
| <> | 0.85 | 1 - p(=) | 0.5
| < | 0.425 | p(<>) / 2 | 0.5
| > | 0.425 | p(<>) / 2 | 0.5
| <= | 0.575 | p(<) + p(=) | 0.5
| >= | 0.575 | p(>) + p(=) | 0.5
| LIKE | 0.25 | Default in Calcite | 0.25
| NOT LIKE | 0.75 | 1 - p(LIKE) | 0.25
| NOT NULL | 0.90 | Default in Calcite | 0.90
| IS NULL | 0.10 | 1 - p(NOT NULL) | 0.25
| IS TRUE | 0.5 | 1 / 2 | 0.25
| IS FALSE | 0.5 | 1 / 2 | 0.25
| IS NOT TRUE | 0.55 | (1 - p(IS NULL))/2 | 0.25
| IS NOT FALSE | 0.55 | (1 - p(IS NULL))/2 | 0.25
| A OR B | Varies | min(p(A) + p(B), 0.5) | 0.5
| A AND B | Varies | p(A ^ B) = p(A) * p(B) | Same
| IN (a) | 0.15 | p(=) | 0.25
| x IN (a, b, c, ...) | Varies | p(x = a v x = b v x = c v ...) | 0.25
| NOT A | Varies | 1 - p(A) | 0.25
| BETWEEN a AND b | 0.33 | p(<= ^ >=) | 0.25
| NOT BETWEEN a AND b | 0.67 | 1 - p(BETWEEN) | 0.25

The Calcite defaults were identified by inspection and verified by tests. The 
Calcite rules make sense if one considers conditional probability: that the 
user applied a particular expression to the data with the expectation that 
given that data set, the expression matches 25% of the rows.

The probability of the IS NOT TRUE statement assumes the presence of nulls, 
while IS TRUE does not. This is an example of conditional probability. If the 
user asks if something is TRUE, we can assume that they are nor worried about 
nulls. If they ask if it is NOT FALSE, the only justification for such syntax 
is to include nulls. Since we assume nulls (when present) make up 10% of data, 
in such a case, half the non-null data is true, half is false.

Note that, in the proposed implementation, each expression in an IS TRUE or 
similar clause is considered as a constant. That is, the following deliver 
different reduction factors:

{code}
a = 10
(a = 10) IS TRUE
{code}

One could argue that the second should give the selectivity for "a = 10", but 
the code (at present) treats this the same as "b IS TRUE" where b just happens 
to be "a = 10".

The rule for OR caps the reduction factor at 0.5 per standard practice. We 
assume values are independent, so the reduction is simply the sum of the 
individual items. (If data consists of "a", "b", "c" and "d", then p("a") = 25% 
and p("a" | "b" | "c" | "d" ) = 100%.

The rule for BETWEEN arises because Calcite (or Drill?) rewrites {{a BETWEEN b 
AND c}} as {{a >= b AND a <= c}}.

With the revised rules, the example WHERE reduction becomes:

{code}
col1_s20 in ('Value1','Value2','Value3','Value4',
                     'Value5','Value6','Value7','Value8','Value9') -- 50%
AND col2_i <=3 -- 57%
AND col3_s1 = 'Y' -- 15%
AND col4_s1 = 'Y' -- 15%
AND col5_s6 not like '%str1%' -- 85%
AND col5_s6 not like '%str2%' -- 85%
AND col5_s6 not like '%str3%' -- 85%
AND col5_s6 not like '%str4%' -- 85%

.5 * .57 * .15^2 * .85^4 = 0.003
{code}

The new rules are not a panacea: they are still just guesses. However, they are 
unbiased guesses based on the rules of probability which result in more 
conservative reductions of filters. The result may be better plans in queries 
with large conjunctions (large number of expressions AND'ed together.)


> Enhance default reduction factors in optimizer
> ----------------------------------------------
>
>                 Key: DRILL-5254
>                 URL: https://issues.apache.org/jira/browse/DRILL-5254
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.10
>
>
> Drill uses Calcite for query parsing and optimization. Drill uses Calcite's 
> default selectivity (reduction factor) rules to compute the number of rows 
> removed by a filter.
> The default rules appear to be overly aggressive in estimating reductions. In 
> a production use case, an input with 4 billion rows was estimated to return 
> just 40K rows from a filter. That is, the filter estimated a 1/1,000,000 
> reduction in rows. As it turns out, the actual reduction was closer to 1/2.
> The result was that the planner compared the expected 40K rows against 
> another input of 2.5 million rows, and decided the 40K rows would be best on 
> the build side of a hash join. When confronted with the actual 3 billion 
> rows, the hash join ran out of memory.
> The moral of the story is that, in Drill, it is worth being conservative when 
> planning for memory-intensive operations.
> The (sanitized) filter is the following, annotated with (a guess at) the 
> default reduction factors in each term:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
>                      'Value5','Value6','Value7','Value8','Value9') -- 25%
> AND col2_i <=3 -- 25%
> AND col3_s1 = 'Y' -- 15%
> AND col4_s1 = 'Y' -- 15%
> AND col5_s6 not like '%str1%' -- 25%
> AND col5_s6 not like '%str2%' -- 25%
> AND col5_s6 not like '%str3%' -- 25%
> AND col5_s6 not like '%str4%' -- 25%
> {code}
> Total reduction is something like:
> {code}
> .25 * .25 * .15 ^ 2 * .25 ^ 4 = 0.000005
> {code}
> Filter estimation is a known hard problem. In general, one needs statistics 
> and other data, and even then the estimates are just guesses.
> Still it is possible to ensure that the defaults are at least unbiased. That 
> is if we assume that the probability of A LIKE B being 25%, then the 
> probability of A NOT LIKE B should be 75%, not also 25%.
> This JIRA suggests creating an experimental set of defaults based on the 
> "core" Calcite defaults, but with other reduction factors derived using the 
> laws of probability. In particular:
> || Operator || Revised || Explanation || Calcite Default
> | = | 0.15 | Default in Calcite | 0.15
> | <> | 0.85 | 1 - p(=) | 0.5
> | < | 0.425 | p(<>) / 2 | 0.5
> | > | 0.425 | p(<>) / 2 | 0.5
> | <= | 0.575 | p(<) + p(=) | 0.5
> | >= | 0.575 | p(>) + p(=) | 0.5
> | LIKE | 0.25 | Default in Calcite | 0.25
> | NOT LIKE | 0.75 | 1 - p(LIKE) | 0.25
> | NOT NULL | 0.90 | Default in Calcite | 0.90
> | IS NULL | 0.10 | 1 - p(NOT NULL) | 0.25
> | IS TRUE | 0.5 | 1 / 2 | 0.25
> | IS FALSE | 0.5 | 1 / 2 | 0.25
> | IS NOT TRUE | 0.55 | (1 - p(IS NULL)) / 2 | 0.25
> | IS NOT FALSE | 0.55 | (1 - p(IS NULL)) / 2 | 0.25
> | A OR B | Varies | min(p(A) + p(B), 0.5) | 0.5
> | A AND B | Varies | p(A ^ B) = p(A) * p(B) | Same
> | IN (a) | 0.15 | p(=) | 0.25
> | x IN (a, b, c, ...) | Varies | p(x = a v x = b v x = c v ...) | 0.25
> | NOT A | Varies | 1 - p(A) | 0.25
> | BETWEEN a AND b | 0.33 | p(<= ^ >=) | 0.25
> | NOT BETWEEN a AND b | 0.67 | 1 - p(BETWEEN) | 0.25
> The Calcite defaults were identified by inspection and verified by tests. The 
> Calcite rules make sense if one considers conditional probability: that the 
> user applied a particular expression to the data with the expectation that 
> given that data set, the expression matches 25% of the rows.
> The probability of the IS NOT TRUE statement assumes the presence of nulls, 
> while IS TRUE does not. This is an example of conditional probability. If the 
> user asks if something is TRUE, we can assume that they are nor worried about 
> nulls. If they ask if it is NOT FALSE, the only justification for such syntax 
> is to include nulls. Since we assume nulls (when present) make up 10% of 
> data, in such a case, half the non-null data is true, half is false.
> Note that, in the proposed implementation, each expression in an IS TRUE or 
> similar clause is considered as a constant. That is, the following deliver 
> different reduction factors:
> {code}
> a = 10
> (a = 10) IS TRUE
> {code}
> One could argue that the second should give the selectivity for "a = 10", but 
> the code (at present) treats this the same as "b IS TRUE" where b just 
> happens to be "a = 10".
> The rule for OR caps the reduction factor at 0.5 per standard practice. We 
> assume values are independent, so the reduction is simply the sum of the 
> individual items. (If data consists of "a", "b", "c" and "d", then p("a") = 
> 25% and p("a" | "b" | "c" | "d" ) = 100%.
> The rule for BETWEEN arises because Calcite (or Drill?) rewrites {{a BETWEEN 
> b AND c}} as {{a >= b AND a <= c}}.
> With the revised rules, the example WHERE reduction becomes:
> {code}
> col1_s20 in ('Value1','Value2','Value3','Value4',
>                      'Value5','Value6','Value7','Value8','Value9') -- 50%
> AND col2_i <=3 -- 57%
> AND col3_s1 = 'Y' -- 15%
> AND col4_s1 = 'Y' -- 15%
> AND col5_s6 not like '%str1%' -- 85%
> AND col5_s6 not like '%str2%' -- 85%
> AND col5_s6 not like '%str3%' -- 85%
> AND col5_s6 not like '%str4%' -- 85%
> .5 * .57 * .15^2 * .85^4 = 0.003
> {code}
> The new rules are not a panacea: they are still just guesses. However, they 
> are unbiased guesses based on the rules of probability which result in more 
> conservative reductions of filters. The result may be better plans in queries 
> with large conjunctions (large number of expressions AND'ed together.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to