[
https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krisztian Kasa updated HIVE-28080:
----------------------------------
Fix Version/s: 4.1.0
Resolution: Fixed
Status: Resolved (was: Patch Available)
Merged to master. Thanks [~okumin] for the patch.
> Propagate statistics from a source table to the materialized CTE
> ----------------------------------------------------------------
>
> Key: HIVE-28080
> URL: https://issues.apache.org/jira/browse/HIVE-28080
> Project: Hive
> Issue Type: Improvement
> Components: Query Planning
> Affects Versions: 2.3.8, 3.1.3, 4.0.0-beta-1
> Reporter: okumin
> Assignee: okumin
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Hive doesn't fill in the statistics of materialized CTEs, and the size of
> those TableScans is underestimated. That causes Tez to run with fewer tasks
> or to fail with OOM because MapJoin could be wrongly applied.
>
> The following example shows Map 1 reading `src` generates 493 rows, but Map 3
> reading `cte` is expected to scan only 1 row.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN WITH cte AS (
> . . . . . . . . . . . . . . . . . . . . . . .> SELECT * FROM src
> . . . . . . . . . . . . . . . . . . . . . . .> )
> . . . . . . . . . . . . . . . . . . . . . . .> SELECT *
> . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
> ...
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | Plan optimized by CBO. |
> | |
> | Vertex dependency in Stage-4 |
> | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
> | Reducer 4 <- Map 3 (SIMPLE_EDGE) |
> | |
> | Stage-3 |
> | Fetch Operator |
> | limit:-1 |
> | Stage-4 |
> | Map 2 vectorized |
> | File Output Operator [FS_69] |
> | Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
> |
> Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
> |
> | <-Map 3 [BROADCAST_EDGE] vectorized |
> | SHUFFLE [RS_61] |
> | PartitionCols:_col0 |
> | Select Operator [SEL_60] (rows=1 width=368) |
> | Output:["_col0","_col1"] |
> | Filter Operator [FIL_59] (rows=1 width=368) |
> | predicate:key is not null |
> | TableScan [TS_11] (rows=1 width=368) |
> |
> default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> | <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
> |
> Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"]
> |
> | <-Reducer 4 [BROADCAST_EDGE] vectorized |
> | BROADCAST [RS_64] |
> | PartitionCols:_col0 |
> | Select Operator [SEL_63] |
> | Output:["_col0","_col1"] |
> | <-Select Operator [SEL_66] (rows=1 width=368) |
> | Output:["_col0","_col1"] |
> | Filter Operator [FIL_65] (rows=1 width=368) |
> | predicate:key is not null |
> | TableScan [TS_5] (rows=1 width=368) |
> |
> default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> | Stage-2 |
> | Dependency Collection{} |
> | Stage-1 |
> | Map 1 vectorized |
> | File Output Operator [FS_4] |
> | table:{"name:":"default.cte"} |
> | Select Operator [SEL_3] (rows=493 width=350) |
> | Output:["_col0","_col1"] |
> | TableScan [TS_0] (rows=493 width=350) |
> |
> default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> | Stage-0 |
> | Move Operator |
> | Please refer to the previous Stage-1 |
> | |
> +----------------------------------------------------+ {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)