[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE

Krisztian Kasa (Jira) Tue, 27 Feb 2024 01:15:41 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Krisztian Kasa updated HIVE-28080:
----------------------------------
    Fix Version/s: 4.1.0
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

Merged to master. Thanks [~okumin] for the patch.

> Propagate statistics from a source table to the materialized CTE
> ----------------------------------------------------------------
>
>                 Key: HIVE-28080
>                 URL: https://issues.apache.org/jira/browse/HIVE-28080
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Planning
>    Affects Versions: 2.3.8, 3.1.3, 4.0.0-beta-1
>            Reporter: okumin
>            Assignee: okumin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.1.0
>
>
> Hive doesn't fill in the statistics of materialized CTEs, and the size of 
> those TableScans is underestimated. That causes Tez to run with fewer tasks 
> or to fail with OOM because MapJoin could be wrongly applied.
>  
> The following example shows Map 1 reading `src` generates 493 rows, but Map 3 
> reading `cte` is expected to scan only 1 row.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN WITH cte AS (
> . . . . . . . . . . . . . . . . . . . . . . .>   SELECT * FROM src
> . . . . . . . . . . . . . . . . . . . . . . .> )
> . . . . . . . . . . . . . . . . . . . . . . .> SELECT *
> . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
> ...
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in Stage-4                       |
> | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
> | Reducer 4 <- Map 3 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-3                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-4                                        |
> |       Map 2 vectorized                             |
> |       File Output Operator [FS_69]                 |
> |         Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
> |           
> Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
>  |
> |         <-Map 3 [BROADCAST_EDGE] vectorized        |
> |           SHUFFLE [RS_61]                          |
> |             PartitionCols:_col0                    |
> |             Select Operator [SEL_60] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_59] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_11] (rows=1 width=368) |
> |                   
> default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
> |             
> Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"]
>  |
> |           <-Reducer 4 [BROADCAST_EDGE] vectorized  |
> |             BROADCAST [RS_64]                      |
> |               PartitionCols:_col0                  |
> |               Select Operator [SEL_63]             |
> |                 Output:["_col0","_col1"]           |
> |           <-Select Operator [SEL_66] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_65] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_5] (rows=1 width=368) |
> |                   
> default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-2                                    |
> |           Dependency Collection{}                  |
> |             Stage-1                                |
> |               Map 1 vectorized                     |
> |               File Output Operator [FS_4]          |
> |                 table:{"name:":"default.cte"}      |
> |                 Select Operator [SEL_3] (rows=493 width=350) |
> |                   Output:["_col0","_col1"]         |
> |                   TableScan [TS_0] (rows=493 width=350) |
> |                     
> default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-0                                    |
> |           Move Operator                            |
> |              Please refer to the previous Stage-1  |
> |                                                    |
> +----------------------------------------------------+ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE

Reply via email to