[jira] [Work logged] (HIVE-27327) Iceberg basic stats: Incorrect row count in snapshot summary leading to unoptimized plans

ASF GitHub Bot (Jira) Tue, 09 May 2023 01:47:35 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27327?focusedWorklogId=861149&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-861149
 ]


ASF GitHub Bot logged work on HIVE-27327:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/May/23 08:46
            Start Date: 09/May/23 08:46
    Worklog Time Spent: 10m 
      Work Description: zhangbutao commented on code in PR #4301:
URL: https://github.com/apache/hive/pull/4301#discussion_r1188325890


##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -346,7 +346,16 @@ public Map<String, String> getBasicStatistics(Partish 
partish) {
               stats.put(StatsSetupConst.NUM_FILES, 
summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
             }
             if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
-              stats.put(StatsSetupConst.ROW_COUNT, 
summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              long totalRecords = 
Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) &&
+                  summary.containsKey(SnapshotSummary.TOTAL_POS_DELETES_PROP)) 
{
+                Long actualRecords =

Review Comment:
   Just share some my thought.
   Not sure if i am understand correctly, the delete file in iceberg is also a 
special data file, and table scan in actual execution stage also should read 
all related delete files.
   
   That is to say, the actual execution still requires scanning more data than 
the explain shows.
   So, i am not sure if this PR can be give a optimized plans when iceberg 
table has both data files and delete files.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 861149)
    Time Spent: 50m  (was: 40m)

> Iceberg basic stats: Incorrect row count in snapshot summary leading to 
> unoptimized plans
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-27327
>                 URL: https://issues.apache.org/jira/browse/HIVE-27327
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Simhadri Govindappa
>            Assignee: Simhadri Govindappa
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> In the absence of equality deletes, the total row count should be :
> {noformat}
> row_count = total-records - total-position-deletes{noformat}
>  
>  
> Example:
> After many inserts and deletes, there are only 46 records in a table.
> {noformat}
> >>select count(*) from llap_orders;
> +------+
> | _c0  |
> +------+
> | 46   |
> +------+
> 1 row selected (7.22 seconds)
> {noformat}
>  
> But the total records in snapshot summary indicate that there are 300 records
>  
> {noformat}
>  {
>     "sequence-number" : 19,
>     "snapshot-id" : 4237525869561629328,
>     "parent-snapshot-id" : 2572487769557272977,
>     "timestamp-ms" : 1683553017982,
>     "summary" : {
>       "operation" : "append",
>       "added-data-files" : "5",
>       "added-records" : "12",
>       "added-files-size" : "3613",
>       "changed-partition-count" : "5",
>       "total-records" : "300",
>       "total-files-size" : "164405",
>       "total-data-files" : "100",
>       "total-delete-files" : "73",
>       "total-position-deletes" : "254",
>       "total-equality-deletes" : "0"
>     }{noformat}
>  
> As a result of this , the hive plans generated are unoptimized.
> {noformat}
> 0: jdbc:hive2://simhadrigovindappa-2.simhadri> explain update llap_orders set 
> itemid=7 where itemid=5;
> INFO  : OK
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> | Reducer 3 <- Map 1 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-4                                            |
> |   Stats Work{}                                     |
> |     Stage-0                                        |
> |       Move Operator                                |
> |         table:{"name:":"db.llap_orders"}           |
> |         Stage-3                                    |
> |           Dependency Collection{}                  |
> |             Stage-2                                |
> |               Reducer 2 vectorized                 |
> |               File Output Operator [FS_14]         |
> |                 table:{"name:":"db.llap_orders"}   |
> |                 Select Operator [SEL_13] (rows=150 width=424) |
> |                   
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9"]
>  |
> |                 <-Map 1 [SIMPLE_EDGE]              |
> |                   SHUFFLE [RS_4]                   |
> |                     Select Operator [SEL_3] (rows=150 width=424) |
> |                       
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col7","_col8","_col9"]
>  |
> |                       Select Operator [SEL_2] (rows=150 width=644) |
> |                         
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col7","_col8","_col9","_col10","_col11","_col13","_col14","_col15"]
>  |
> |                         Filter Operator [FIL_9] (rows=150 width=220) |
> |                           predicate:(itemid = 5)   |
> |                           TableScan [TS_0] (rows=300 width=220) |
> |                             
> db@llap_orders,llap_orders,Tbl:COMPLETE,Col:COMPLETE,Output:["orderid","quantity","itemid","tradets","p1","p2"]
>  |
> |               Reducer 3 vectorized                 |
> |               File Output Operator [FS_16]         |
> |                 table:{"name:":"db.llap_orders"}   |
> |                 Select Operator [SEL_15]           |
> |                   
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col4","_col5"] |
> |                 <-Map 1 [SIMPLE_EDGE]              |
> |                   SHUFFLE [RS_10]                  |
> |                     PartitionCols:_col4, _col5     |
> |                     Select Operator [SEL_7] (rows=150 width=220) |
> |                       
> Output:["_col0","_col1","_col2","_col3","_col4","_col5"] |
> |                        Please refer to the previous Select Operator [SEL_2] 
> |
> |                                                    |
> +----------------------------------------------------+
> 39 rows selected (0.104 seconds)
> 0: jdbc:hive2://simhadrigovindappa-2.simhadri>{noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-27327) Iceberg basic stats: Incorrect row count in snapshot summary leading to unoptimized plans

Reply via email to