Java heap dump conditions

Shaofeng SHI (JIRA) Sun, 28 Jan 2018 03:41:24 -0800

    [ 
https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342541#comment-16342541
 ]


Shaofeng SHI commented on KYLIN-3122:
-------------------------------------

Dual date/time partition columns was not perfectly implemented I think. It only 
considered the case when fetch data from source (Hive), but not considered the 
segments pruning at query time.

[~mahongbin] [[email protected]] I know you're investigating multiple 
partition columns, can this issue be fixed?

Vsevolod, as a temporary solution, I suggest to use one column as the partition 
column; You can define a new column with view, which compose the "THEDATE" and 
"THEHOUR" column as one column, and then build cube and query with it.

If the problem couldn't be fixed, I would suggest to remove the dual date/time 
partition columns here.

 

 

> Partition elimination algorithm seems to be inefficient and have serious 
> issues with handling date/time ranges, can lead to very slow queries and 
> OOM/Java heap dump conditions
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-3122
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3122
>             Project: Kylin
>          Issue Type: Bug
>          Components: Query Engine
>    Affects Versions: v2.2.0
>         Environment: HDP 2.5.6, Kylin 2.2.0
>            Reporter: Vsevolod Ostapenko
>            Assignee: hongbin ma
>            Priority: Critical
>
> Current algorithm of cube segment elimination seems to be rather inefficient.
>  We are using a model where cubes are partitioned by date and time:
>  "partition_desc":
> { "partition_date_column": "A_VL_HOURLY_V.THEDATE", "partition_time_column": 
> "A_VL_HOURLY_V.THEHOUR", "partition_date_start": 0, "partition_date_format": 
> "yyyyMMdd", "partition_time_format": "HH", "partition_type": "APPEND", 
> "partition_condition_builder": 
> "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
>  }
> ,
> Cubes contain partitions for multiple days and 24 hours for each day. Each 
> cube segment corresponds to just one hour.
> When a query is issued where both date and hour are specified using equality 
> condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially 
> integrates over all the segment cubes (hundreds of them) only to skip all 
> except for the one that needs to be scanned (which can be observed by looking 
> in the logs).
>  The expectation is that Kylin would use existing info on the partitioning 
> columns (date and time) and known hierarchical relations between date and 
> time to locate required partition much more efficiently that linear scan 
> through all the cube partitions.
> Now, if filtering condition is on the range of hours, behavior of the 
> partition pruning and scanning becomes not very logical, which suggests bugs 
> in the logic.
> If filtering condition is on specific date and closed-open range of hours 
> (e.g. thedate = '20171011' and thehour >= '10' and thehour < '11'), in 
> addition to sequentially scanning all the cube partitions (as described 
> above), Kylin will scan HBase tables for all the hours from the specified 
> starting hour and till the last hour of the day (e.g. from hour 10 to 24, 
> instead of just hour 10).
>  As the result query will run much longer that necessary, and might run out 
> of memory, causing JVM heap dump and Kylin server crash.
> If filtering condition is on specific date by hour interval is specified as 
> open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= 
> '10'), Kylin will scan all HBase tables for all the later dates and hours 
> (e.g. from hour 10 and till the most recent hour on the most recent day, 
> which can be hundreds of tables and thousands of regions).
>  As the result query execution will dramatically increase and in most cases 
> Kylin server will be terminated with OOM error and JVM heap dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump conditions

Reply via email to