[ 
https://issues.apache.org/jira/browse/KYLIN-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayue Gao updated KYLIN-2438:
-----------------------------
    Description: 
In order to guard against bad queries that can consume lots of memory and 
potentially crash kylin / hbase server, kylin limits the maximum number of rows 
query can scan. The maximum value is chosen based on two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
region maximum.

This approach however has several deficiencies:
* It doesn't work with complex, varlen metrics very well. The estimated 
threshold could be either too small or too large. If it's too small, good 
queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level. 
No need to rely on estimations about row size any more.
* The above two configs scan be override at cube level
* the old *kylin.query.scan.threshold* will be deprecated

  was:
In order to guard against bad queries that can consume lots of memory and 
potentially crash kylin / hbase server, kylin limits the maximum number of rows 
query can scan. The maximum value is chosen based on two configs
# *kylin.query.scan.threshold* is used if the query doesn't contain 
memory-hungry metrics
# *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
region maximum.

This approach however has several deficiencies:
* It doesn't work with complex, varlen metrics very well. The estimated 
threshold could be either too small or too large. If it's too small, good 
queries are killed. If it's too large, bad queries are not banned.
* Row count doesn't correspond to memory consumption, thus it's difficult to 
determine how large scan threshold should be set to.
* kylin.query.scan.threshold can't be override at cube level.

In this JIRA, I propose to replace the current row count based threshold with a 
more intuitive size based threshold
* KYLIN-2437 will collect the number of bytes scanned at both region and query 
level
* A new configuration *kylin.query.max-scan-bytes* will be added to limits the 
maximum number of bytes query can scan
* *kylin.query.mem.budget* will be renamed to 
*kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region level. 
We don't need to rely on estimations about row size any more.
* The above two configs scan be override at cube level
* the old *kylin.query.scan.threshold* will be deprecated


> replace scan threshold with max scan bytes
> ------------------------------------------
>
>                 Key: KYLIN-2438
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2438
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Query Engine, Storage - HBase
>    Affects Versions: v1.6.0
>            Reporter: Dayue Gao
>            Assignee: Dayue Gao
>
> In order to guard against bad queries that can consume lots of memory and 
> potentially crash kylin / hbase server, kylin limits the maximum number of 
> rows query can scan. The maximum value is chosen based on two configs
> # *kylin.query.scan.threshold* is used if the query doesn't contain 
> memory-hungry metrics
> # *kylin.query.mem.budget* / estimated_row_size is used otherwise as the per 
> region maximum.
> This approach however has several deficiencies:
> * It doesn't work with complex, varlen metrics very well. The estimated 
> threshold could be either too small or too large. If it's too small, good 
> queries are killed. If it's too large, bad queries are not banned.
> * Row count doesn't correspond to memory consumption, thus it's difficult to 
> determine how large scan threshold should be set to.
> * kylin.query.scan.threshold can't be override at cube level.
> In this JIRA, I propose to replace the current row count based threshold with 
> a more intuitive size based threshold
> * KYLIN-2437 will collect the number of bytes scanned at both region and 
> query level
> * A new configuration *kylin.query.max-scan-bytes* will be added to limits 
> the maximum number of bytes query can scan
> * *kylin.query.mem.budget* will be renamed to 
> *kylin.storage.hbase.coprocessor-max-scan-bytes*, which limits at region 
> level. No need to rely on estimations about row size any more.
> * The above two configs scan be override at cube level
> * the old *kylin.query.scan.threshold* will be deprecated



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to