[
https://issues.apache.org/jira/browse/KYLIN-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shaofeng SHI updated KYLIN-2165:
--------------------------------
Description:
Kylin will count on the intermediate flat hive table to get the total row
number, then to redistribute that.
>From hive's wiki, hive will automatically collect the table statistics when
>run a "insert overwrite" statement, then the subsequent "select count(*)" will
>be very fast. While, Kylin is executing "INSERT OVERWRITE DIRECTORY
>'/kylin/row_count' SELECT count(*) from", which still cause MR/Tez job be
>started, this will cause the step take longer time.
Just change the SQL to "select count(*)" or using Hive API to get the
statistic, the cost will be saved.
Here is a sample, the table
'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an intermediate
table :
If directly run "count(*)", it is pretty fast:
hive> select count(*) from
kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
OK
970033
Time taken: 0.112 seconds, Fetched: 1 row(s)
While today Kylin's SQL will cause a job be started:
hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count(*) from
kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1463701915919_46208, Tracking URL =
was:
Kylin will count on the intermediate flat hive table to get the total row
number, then to redistribute that.
>From hive's wiki, hive will automatically collect the table statistics when
>run a "insert overwrite" statement, then the subsequent "select count(*)" will
>be very fast. While, Kylin is executing "INSERT OVERWRITE DIRECTORY
>'/kylin/row_count' SELECT count(*) from", which still cause MR/Tez job be
>started, this will cause the step take longer time.
Just change the SQL to "select count(*)" or using Hive API to get the
statistic, the cost will be saved.
> Use hive table statistics data to get the total count
> -----------------------------------------------------
>
> Key: KYLIN-2165
> URL: https://issues.apache.org/jira/browse/KYLIN-2165
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine
> Reporter: Shaofeng SHI
> Assignee: Shaofeng SHI
> Fix For: v1.6.0
>
>
> Kylin will count on the intermediate flat hive table to get the total row
> number, then to redistribute that.
> From hive's wiki, hive will automatically collect the table statistics when
> run a "insert overwrite" statement, then the subsequent "select count(*)"
> will be very fast. While, Kylin is executing "INSERT OVERWRITE DIRECTORY
> '/kylin/row_count' SELECT count(*) from", which still cause MR/Tez job be
> started, this will cause the step take longer time.
> Just change the SQL to "select count(*)" or using Hive API to get the
> statistic, the cost will be saved.
> Here is a sample, the table
> 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an
> intermediate table :
> If directly run "count(*)", it is pretty fast:
> hive> select count(*) from
> kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
> OK
> 970033
> Time taken: 0.112 seconds, Fetched: 1 row(s)
> While today Kylin's SQL will cause a job be started:
> hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count(*) from
> kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
> Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
> set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
> set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
> set mapreduce.job.reduces=<number>
> Starting Job = job_1463701915919_46208, Tracking URL =
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)