Vsevolod Ostapenko created KYLIN-3122:
-----------------------------------------
Summary: Partition elimination algorithm seems to be inefficient
and have serious issues with handling date/time ranges, can lead to very slow
queries and OOM/Java heap dump conditions
Key: KYLIN-3122
URL: https://issues.apache.org/jira/browse/KYLIN-3122
Project: Kylin
Issue Type: Bug
Components: Storage - HBase
Affects Versions: v2.2.0
Environment: HDP 2.5.6, Kylin 2.2.0
Reporter: Vsevolod Ostapenko
Assignee: hongbin ma
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
"partition_date_column": "A_VL_HOURLY_V.THEDATE",
"partition_time_column": "A_VL_HOURLY_V.THEHOUR",
"partition_date_start": 0,
"partition_date_format": "yyyyMMdd",
"partition_time_format": "HH",
"partition_type": "APPEND",
"partition_condition_builder":
"org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
},
Cubes contain partitions for multiple days and 24 hours for each day. Each cube
segment corresponds to just one hour.
When a query is issued where both date and hour are specified using equality
condition (e.g. thedate = '20171011' and thehour = '00') Kylin sequentially
integrates over all the segment cubes (hundreds of them) only to skip all
except for the one that needs to be scanned (which can be observed by looking
in the logs).
The expectation is that Kylin would use existing info on the partitioning
columns (date and time) and known hierarchical relations between date and time
to locate required partition much more efficiently that linear scan through all
the cube partitions.
Now, if filtering condition is on the range of hours, behavior of the partition
pruning and scanning becomes not very logical, which suggest bugs in the logic.
If condition is on specific date and closed-open range of hours (e.g. thedate =
'20171011' and thehour >= '10' and thehour < '11'), in addition to sequentially
scanning all the cube partitions (as described above), Kylin will scan HBase
regions for all the hours from the starting hour and till the last hour of the
day (e.g. from hour 10 to 24).
As the result query will run much longer that necessary, and might run out of
memory.
If condition is on specific date by hour interval is specified as open-closed
(e.g. thedate = '20171011' and thehour > '10' and thehour <= '11'), Kylin will
scan all HBase regions for all the later dates and hours (e.g. from hour 10 and
till the most recent hour on the most recent day).
As the result query execution will dramatically increase and in most cases
Kylin server will be terminated with OOM error and JVM heap dump.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)