Rajesh Balamohan created HIVE-24296: ---------------------------------------
Summary: NDV adjusted twice causing reducer task underestimation Key: HIVE-24296 URL: https://issues.apache.org/jira/browse/HIVE-24296 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2550] {{StatsRuleProcFactory::updateColStats}}:: {code:java} if (ratio <= 1.0) { newDV = (long) Math.ceil(ratio * oldDV); } cs.setCountDistint(newDV); {code} Though RelHive* has the latest statistics, it is adjusted again {{StatsRuleProcFactory::updateColStats}} and it is done at linear scale. Because of this, downstream vertex gets lesser number of tasks causing latency issues. E.g Q10 + TPCDS @10 TB scale. Attaching a snippet of "explain analyze" which shows stats underestimation. "Reducer 13" is underestimated 10x, when compared to runtime details. Projected NDV from RelHive* was around 65989699. However, due to the ratio calculation in StatsRuleProcFactory, it gets readjusted to ((948122598/14291978461) * 65989699)) ~= 4377723. It would be good to remove static readjustment in StatsRuleProcFactory. {noformat} Edges: Map 10 <- Map 9 (BROADCAST_EDGE) Map 12 <- Map 9 (BROADCAST_EDGE) Map 2 <- Map 7 (BROADCAST_EDGE) Map 8 <- Map 9 (BROADCAST_EDGE), Reducer 6 (BROADCAST_EDGE) Reducer 11 <- Map 10 (SIMPLE_EDGE) Reducer 13 <- Map 12 (SIMPLE_EDGE) Reducer 3 <- Map 1 (BROADCAST_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE), Map 8 (CUSTOM_SIMPLE_EDGE), Reducer 11 (BROADCAST_EDGE), Reducer 13 (BROADCAST_EDGE) Reducer 4 <- Reducer 3 (SIMPLE_EDGE) Reducer 5 <- Reducer 4 (SIMPLE_EDGE) Reducer 6 <- Map 2 (CUSTOM_SIMPLE_EDGE) Map 12 Map Operator Tree: TableScan alias: catalog_sales filterExpr: cs_ship_customer_sk is not null (type: boolean) Statistics: Num rows: 14327953968/552509183 Data size: 228959459440 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: cs_ship_customer_sk is not null (type: boolean) Statistics: Num rows: 14291978461/551122492 Data size: 228384573968 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: cs_ship_customer_sk (type: bigint), cs_sold_date_sk (type: bigint) outputColumnNames: _col0, _col1 Statistics: Num rows: 14291978461/551122492 Data size: 228384573968 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col1 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col0 input vertices: 1 Map 9 Statistics: Num rows: 948122598/551122492 Data size: 7297899376 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator keys: _col0 (type: bigint) minReductionHashAggr: 0.99 mode: hash outputColumnNames: _col0 Statistics: Num rows: 126954025/61576194 Data size: 977191880 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: bigint) null sort order: a sort order: + Map-reduce partition columns: _col0 (type: bigint) Statistics: Num rows: 126954025/61576194 Data size: 977191880 Basic stats: COMPLETE Column stats: COMPLETE ... ... Reducer 13 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: bigint) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 4377725/40166690 Data size: 33696280 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: true (type: boolean), _col0 (type: bigint) outputColumnNames: _col0, _col1 Statistics: Num rows: 4377725/40166690 Data size: 51207180 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col1 (type: bigint) null sort order: a sort order: + Map-reduce partition columns: _col1 (type: bigint) Statistics: Num rows: 4377725/40166690 Data size: 51207180 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: boolean) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)