[
https://issues.apache.org/jira/browse/HIVE-27119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772444#comment-17772444
]
okumin commented on HIVE-27119:
-------------------------------
[~rajesh.balamohan] I think Auto Reduce Parallelism would help you.
{code:java}
zookage@client-node-0:~$ beeline -e 'delete from store_Sales where
ss_customer_sk % 10 = 0' --hiveconf hive.tez.auto.reducer.parallelism=true
--hiveconf hive.server2.in.place.progress=false
...
INFO : 2023-10-06 05:58:56,843 Map 1: -/- Reducer 2: 0/8
INFO : 2023-10-06 05:58:58,367 Map 1: 0/1 Reducer 2: 0/8
INFO : 2023-10-06 05:59:01,396 Map 1: 0(+1)/1 Reducer 2: 0/8
INFO : 2023-10-06 05:59:12,491 Map 1: 1/1 Reducer 2: 0/8
INFO : 2023-10-06 05:59:13,005 Map 1: 1/1 Reducer 2: 2/2
{code}
> Iceberg: Delete from table generates lot of files
> -------------------------------------------------
>
> Key: HIVE-27119
> URL: https://issues.apache.org/jira/browse/HIVE-27119
> Project: Hive
> Issue Type: Improvement
> Components: Iceberg integration
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance
>
> With "delete" it generates lot of files due to the way data is sent to the
> reducers. Files per partition is impacted by the number of reduce tasks.
> One way could be to explicitly control the number of reducers; Creating this
> ticket to have a long term fix.
>
> {noformat}
> explain delete from store_Sales where ss_customer_sk % 10 = 0;
> INFO : Compiling
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b):
> explain delete from store_Sales where ss_customer_sk % 10 = 0
> INFO : No Stats for tpcds_1000_iceberg_mor_v4@store_sales, Columns:
> ss_sold_time_sk, ss_cdemo_sk, ss_promo_sk, ss_ext_discount_amt,
> ss_ext_sales_price, ss_net_profit, ss_addr_sk, ss_ticket_number,
> ss_wholesale_cost, ss_item_sk, ss_ext_list_price, ss_sold_date_sk,
> ss_store_sk, ss_coupon_amt, ss_quantity, ss_list_price, ss_sales_price,
> ss_customer_sk, ss_ext_wholesale_cost, ss_net_paid, ss_ext_tax, ss_hdemo_sk,
> ss_net_paid_inc_tax
> INFO : Semantic Analysis Completed (retrial = false)
> INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain,
> type:string, comment:null)], properties:null)
> INFO : Completed compiling
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b);
> Time taken: 0.704 seconds
> INFO : Executing
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b):
> explain delete from store_Sales where ss_customer_sk % 10 = 0
> INFO : Starting task [Stage-4:EXPLAIN] in serial mode
> INFO : Completed executing
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b);
> Time taken: 0.005 seconds
> INFO : OK
> Explain
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-2 depends on stages: Stage-1
> Stage-0 depends on stages: Stage-2
> Stage-3 depends on stages: Stage-0
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> DagName: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> filterExpr: ((ss_customer_sk % 10) = 0) (type: boolean)
> Statistics: Num rows: 2755519629 Data size: 3643899155232
> Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: ((ss_customer_sk % 10) = 0) (type: boolean)
> Statistics: Num rows: 1377759814 Data size: 1821949576954
> Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: PARTITION__SPEC__ID (type: int),
> PARTITION__HASH (type: bigint), FILE__PATH (type: string), ROW__POSITION
> (type: bigint), ss_sold_time_sk (type: int), ss_item_sk (type: int),
> ss_customer_sk (type: int), ss_cdemo_sk (type: int), ss_hdemo_sk (type: int),
> ss_addr_sk (type: int), ss_store_sk (type: int), ss_promo_sk (type: int),
> ss_ticket_number (type: bigint), ss_quantity (type: int), ss_wholesale_cost
> (type: decimal(7,2)), ss_list_price (type: decimal(7,2)), ss_sales_price
> (type: decimal(7,2)), ss_ext_discount_amt (type: decimal(7,2)),
> ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type:
> decimal(7,2)), ss_ext_list_price (type: decimal(7,2)), ss_ext_tax (type:
> decimal(7,2)), ss_coupon_amt (type: decimal(7,2)), ss_net_paid (type:
> decimal(7,2)), ss_net_paid_inc_tax (type: decimal(7,2)), ss_net_profit (type:
> decimal(7,2)), ss_sold_date_sk (type: int)
> outputColumnNames: _col0, _col1, _col2, _col3, _col4,
> _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14,
> _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23,
> _col24, _col25, _col26
> Statistics: Num rows: 1377759814 Data size:
> 1821949576954 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int), _col1 (type:
> bigint), _col2 (type: string), _col3 (type: bigint)
> null sort order: zzzz
> sort order: ++++
> Statistics: Num rows: 1377759814 Data size:
> 1821949576954 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col4 (type: int), _col5 (type:
> int), _col6 (type: int), _col7 (type: int), _col8 (type: int), _col9 (type:
> int), _col10 (type: int), _col11 (type: int), _col12 (type: bigint), _col13
> (type: int), _col14 (type: decimal(7,2)), _col15 (type: decimal(7,2)), _col16
> (type: decimal(7,2)), _col17 (type: decimal(7,2)), _col18 (type:
> decimal(7,2)), _col19 (type: decimal(7,2)), _col20 (type: decimal(7,2)),
> _col21 (type: decimal(7,2)), _col22 (type: decimal(7,2)), _col23 (type:
> decimal(7,2)), _col24 (type: decimal(7,2)), _col25 (type: decimal(7,2)),
> _col26 (type: int)
> Execution mode: llap
> LLAP IO: no inputs
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Select Operator
> expressions: KEY.reducesinkkey0 (type: int),
> KEY.reducesinkkey1 (type: bigint), KEY.reducesinkkey2 (type: string),
> KEY.reducesinkkey3 (type: bigint), VALUE._col0 (type: int), VALUE._col1
> (type: int), VALUE._col2 (type: int), VALUE._col3 (type: int), VALUE._col4
> (type: int), VALUE._col5 (type: int), VALUE._col6 (type: int), VALUE._col7
> (type: int), VALUE._col8 (type: bigint), VALUE._col9 (type: int),
> VALUE._col10 (type: decimal(7,2)), VALUE._col11 (type: decimal(7,2)),
> VALUE._col12 (type: decimal(7,2)), VALUE._col13 (type: decimal(7,2)),
> VALUE._col14 (type: decimal(7,2)), VALUE._col15 (type: decimal(7,2)),
> VALUE._col16 (type: decimal(7,2)), VALUE._col17 (type: decimal(7,2)),
> VALUE._col18 (type: decimal(7,2)), VALUE._col19 (type: decimal(7,2)),
> VALUE._col20 (type: decimal(7,2)), VALUE._col21 (type: decimal(7,2)),
> VALUE._col22 (type: int)
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
> _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15,
> _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24,
> _col25, _col26
> Statistics: Num rows: 1377759814 Data size: 1821949576954
> Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1377759814 Data size: 1821949576954
> Basic stats: COMPLETE Column stats: NONE
> table:
> input format:
> org.apache.iceberg.mr.hive.HiveIcebergInputFormat
> output format:
> org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
> serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe
> name: tpcds_1000_iceberg_mor_v4.store_sales
> Stage: Stage-2
> Dependency Collection
> Stage: Stage-0
> Move Operator
> tables:
> replace: false
> table:
> input format: org.apache.iceberg.mr.hive.HiveIcebergInputFormat
> output format:
> org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
> serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe
> name: tpcds_1000_iceberg_mor_v4.store_sales
> Stage: Stage-3
> Stats Work
> Basic Stats Work:
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)