[
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342541#comment-16342541
]
Shaofeng SHI commented on KYLIN-3122:
-------------------------------------
Dual date/time partition columns was not perfectly implemented I think. It only
considered the case when fetch data from source (Hive), but not considered the
segments pruning at query time.
[~mahongbin] [[email protected]] I know you're investigating multiple
partition columns, can this issue be fixed?
Vsevolod, as a temporary solution, I suggest to use one column as the partition
column; You can define a new column with view, which compose the "THEDATE" and
"THEHOUR" column as one column, and then build cube and query with it.
If the problem couldn't be fixed, I would suggest to remove the dual date/time
partition columns here.
> Partition elimination algorithm seems to be inefficient and have serious
> issues with handling date/time ranges, can lead to very slow queries and
> OOM/Java heap dump conditions
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KYLIN-3122
> URL: https://issues.apache.org/jira/browse/KYLIN-3122
> Project: Kylin
> Issue Type: Bug
> Components: Query Engine
> Affects Versions: v2.2.0
> Environment: HDP 2.5.6, Kylin 2.2.0
> Reporter: Vsevolod Ostapenko
> Assignee: hongbin ma
> Priority: Critical
>
> Current algorithm of cube segment elimination seems to be rather inefficient.
> We are using a model where cubes are partitioned by date and time:
> "partition_desc":
> { "partition_date_column": "A_VL_HOURLY_V.THEDATE", "partition_time_column":
> "A_VL_HOURLY_V.THEHOUR", "partition_date_start": 0, "partition_date_format":
> "yyyyMMdd", "partition_time_format": "HH", "partition_type": "APPEND",
> "partition_condition_builder":
> "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
> }
> ,
> Cubes contain partitions for multiple days and 24 hours for each day. Each
> cube segment corresponds to just one hour.
> When a query is issued where both date and hour are specified using equality
> condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially
> integrates over all the segment cubes (hundreds of them) only to skip all
> except for the one that needs to be scanned (which can be observed by looking
> in the logs).
> The expectation is that Kylin would use existing info on the partitioning
> columns (date and time) and known hierarchical relations between date and
> time to locate required partition much more efficiently that linear scan
> through all the cube partitions.
> Now, if filtering condition is on the range of hours, behavior of the
> partition pruning and scanning becomes not very logical, which suggests bugs
> in the logic.
> If filtering condition is on specific date and closed-open range of hours
> (e.g. thedate = '20171011' and thehour >= '10' and thehour < '11'), in
> addition to sequentially scanning all the cube partitions (as described
> above), Kylin will scan HBase tables for all the hours from the specified
> starting hour and till the last hour of the day (e.g. from hour 10 to 24,
> instead of just hour 10).
> As the result query will run much longer that necessary, and might run out
> of memory, causing JVM heap dump and Kylin server crash.
> If filtering condition is on specific date by hour interval is specified as
> open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <=
> '10'), Kylin will scan all HBase tables for all the later dates and hours
> (e.g. from hour 10 and till the most recent hour on the most recent day,
> which can be hundreds of tables and thousands of regions).
> As the result query execution will dramatically increase and in most cases
> Kylin server will be terminated with OOM error and JVM heap dump.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)