[jira] [Commented] (HIVE-27119) Iceberg: Delete from table generates lot of files

okumin (Jira) Thu, 05 Oct 2023 23:01:07 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-27119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772444#comment-17772444
 ]


okumin commented on HIVE-27119:
-------------------------------

[~rajesh.balamohan] I think Auto Reduce Parallelism would help you.


{code:java}
zookage@client-node-0:~$ beeline -e 'delete from store_Sales where 
ss_customer_sk % 10 = 0' --hiveconf hive.tez.auto.reducer.parallelism=true 
--hiveconf hive.server2.in.place.progress=false
...
INFO  : 2023-10-06 05:58:56,843 Map 1: -/-      Reducer 2: 0/8  
INFO  : 2023-10-06 05:58:58,367 Map 1: 0/1      Reducer 2: 0/8  
INFO  : 2023-10-06 05:59:01,396 Map 1: 0(+1)/1  Reducer 2: 0/8  
INFO  : 2023-10-06 05:59:12,491 Map 1: 1/1      Reducer 2: 0/8  
INFO  : 2023-10-06 05:59:13,005 Map 1: 1/1      Reducer 2: 2/2  
{code}


> Iceberg: Delete from table generates lot of files
> -------------------------------------------------
>
>                 Key: HIVE-27119
>                 URL: https://issues.apache.org/jira/browse/HIVE-27119
>             Project: Hive
>          Issue Type: Improvement
>          Components: Iceberg integration
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance
>
> With "delete" it generates lot of files due to the way data is sent to the 
> reducers. Files per partition is impacted by the number of reduce tasks.
> One way could be to explicitly control the number of reducers; Creating this 
> ticket to have a long term fix.
>  
> {noformat}
>  explain delete from store_Sales where ss_customer_sk % 10 = 0;
> INFO  : Compiling 
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): 
> explain delete from store_Sales where ss_customer_sk % 10 = 0
> INFO  : No Stats for tpcds_1000_iceberg_mor_v4@store_sales, Columns: 
> ss_sold_time_sk, ss_cdemo_sk, ss_promo_sk, ss_ext_discount_amt, 
> ss_ext_sales_price, ss_net_profit, ss_addr_sk, ss_ticket_number, 
> ss_wholesale_cost, ss_item_sk, ss_ext_list_price, ss_sold_date_sk, 
> ss_store_sk, ss_coupon_amt, ss_quantity, ss_list_price, ss_sales_price, 
> ss_customer_sk, ss_ext_wholesale_cost, ss_net_paid, ss_ext_tax, ss_hdemo_sk, 
> ss_net_paid_inc_tax
> INFO  : Semantic Analysis Completed (retrial = false)
> INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, 
> type:string, comment:null)], properties:null)
> INFO  : Completed compiling 
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); 
> Time taken: 0.704 seconds
> INFO  : Executing 
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): 
> explain delete from store_Sales where ss_customer_sk % 10 = 0
> INFO  : Starting task [Stage-4:EXPLAIN] in serial mode
> INFO  : Completed executing 
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); 
> Time taken: 0.005 seconds
> INFO  : OK
> Explain
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-2 depends on stages: Stage-1
>   Stage-0 depends on stages: Stage-2
>   Stage-3 depends on stages: Stage-0
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>       DagName: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: ((ss_customer_sk % 10) = 0) (type: boolean)
>                   Statistics: Num rows: 2755519629 Data size: 3643899155232 
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: ((ss_customer_sk % 10) = 0) (type: boolean)
>                     Statistics: Num rows: 1377759814 Data size: 1821949576954 
> Basic stats: COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: PARTITION__SPEC__ID (type: int), 
> PARTITION__HASH (type: bigint), FILE__PATH (type: string), ROW__POSITION 
> (type: bigint), ss_sold_time_sk (type: int), ss_item_sk (type: int), 
> ss_customer_sk (type: int), ss_cdemo_sk (type: int), ss_hdemo_sk (type: int), 
> ss_addr_sk (type: int), ss_store_sk (type: int), ss_promo_sk (type: int), 
> ss_ticket_number (type: bigint), ss_quantity (type: int), ss_wholesale_cost 
> (type: decimal(7,2)), ss_list_price (type: decimal(7,2)), ss_sales_price 
> (type: decimal(7,2)), ss_ext_discount_amt (type: decimal(7,2)), 
> ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: 
> decimal(7,2)), ss_ext_list_price (type: decimal(7,2)), ss_ext_tax (type: 
> decimal(7,2)), ss_coupon_amt (type: decimal(7,2)), ss_net_paid (type: 
> decimal(7,2)), ss_net_paid_inc_tax (type: decimal(7,2)), ss_net_profit (type: 
> decimal(7,2)), ss_sold_date_sk (type: int)
>                       outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, 
> _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, 
> _col24, _col25, _col26
>                       Statistics: Num rows: 1377759814 Data size: 
> 1821949576954 Basic stats: COMPLETE Column stats: NONE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int), _col1 (type: 
> bigint), _col2 (type: string), _col3 (type: bigint)
>                         null sort order: zzzz
>                         sort order: ++++
>                         Statistics: Num rows: 1377759814 Data size: 
> 1821949576954 Basic stats: COMPLETE Column stats: NONE
>                         value expressions: _col4 (type: int), _col5 (type: 
> int), _col6 (type: int), _col7 (type: int), _col8 (type: int), _col9 (type: 
> int), _col10 (type: int), _col11 (type: int), _col12 (type: bigint), _col13 
> (type: int), _col14 (type: decimal(7,2)), _col15 (type: decimal(7,2)), _col16 
> (type: decimal(7,2)), _col17 (type: decimal(7,2)), _col18 (type: 
> decimal(7,2)), _col19 (type: decimal(7,2)), _col20 (type: decimal(7,2)), 
> _col21 (type: decimal(7,2)), _col22 (type: decimal(7,2)), _col23 (type: 
> decimal(7,2)), _col24 (type: decimal(7,2)), _col25 (type: decimal(7,2)), 
> _col26 (type: int)
>             Execution mode: llap
>             LLAP IO: no inputs
>         Reducer 2
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int), 
> KEY.reducesinkkey1 (type: bigint), KEY.reducesinkkey2 (type: string), 
> KEY.reducesinkkey3 (type: bigint), VALUE._col0 (type: int), VALUE._col1 
> (type: int), VALUE._col2 (type: int), VALUE._col3 (type: int), VALUE._col4 
> (type: int), VALUE._col5 (type: int), VALUE._col6 (type: int), VALUE._col7 
> (type: int), VALUE._col8 (type: bigint), VALUE._col9 (type: int), 
> VALUE._col10 (type: decimal(7,2)), VALUE._col11 (type: decimal(7,2)), 
> VALUE._col12 (type: decimal(7,2)), VALUE._col13 (type: decimal(7,2)), 
> VALUE._col14 (type: decimal(7,2)), VALUE._col15 (type: decimal(7,2)), 
> VALUE._col16 (type: decimal(7,2)), VALUE._col17 (type: decimal(7,2)), 
> VALUE._col18 (type: decimal(7,2)), VALUE._col19 (type: decimal(7,2)), 
> VALUE._col20 (type: decimal(7,2)), VALUE._col21 (type: decimal(7,2)), 
> VALUE._col22 (type: int)
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, 
> _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, 
> _col25, _col26
>                 Statistics: Num rows: 1377759814 Data size: 1821949576954 
> Basic stats: COMPLETE Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1377759814 Data size: 1821949576954 
> Basic stats: COMPLETE Column stats: NONE
>                   table:
>                       input format: 
> org.apache.iceberg.mr.hive.HiveIcebergInputFormat
>                       output format: 
> org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
>                       serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe
>                       name: tpcds_1000_iceberg_mor_v4.store_sales
>   Stage: Stage-2
>     Dependency Collection
>   Stage: Stage-0
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.iceberg.mr.hive.HiveIcebergInputFormat
>               output format: 
> org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
>               serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe
>               name: tpcds_1000_iceberg_mor_v4.store_sales
>   Stage: Stage-3
>     Stats Work
>       Basic Stats Work:
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27119) Iceberg: Delete from table generates lot of files

Reply via email to