HEPBO3AH opened a new issue, #9612:
URL: https://github.com/apache/hudi/issues/9612
Hi, we are using Hudi on AWS. We have noticed the following unexpected
behavior.
A `SELECT * FROM table` creates a significant number of S3 calls:
```
+---------------------------------------------------------------------------------------------------------------------------------+----------+---+
|path
|httpMethod|cnt|
+---------------------------------------------------------------------------------------------------------------------------------+----------+---+
|my_table/.hoodie
|HEAD |5 |
|my_table/.hoodie/
|HEAD |5 |
|my_table/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile
|HEAD |5 |
|my_table/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/
|HEAD |5 |
|my_table/.hoodie/20221124035739002.replacecommit
|GET |5 |
|my_table/.hoodie/20221127222955674.replacecommit
|GET |5 |
|my_table/.hoodie/20221128000946056.replacecommit
|GET |5 |
|my_table/.hoodie/20230203015652867.replacecommit
|GET |5 |
|my_table/.hoodie/20230203034909027.replacecommit
|GET |5 |
|my_table/.hoodie/20230323023115954.replacecommit
|GET |5 |
|my_table/.hoodie/20230323024631265.replacecommit
|GET |5 |
|my_table/.hoodie/20230323041457900.replacecommit
|GET |5 |
|my_table/.hoodie/20230627223911673.replacecommit
|GET |5 |
|my_table/.hoodie/20230706040420663.replacecommit
|GET |5 |
|my_table/.hoodie/20230821012127985.replacecommit
|GET |5 |
|my_table/.hoodie/20230821013120957.replacecommit
|GET |5 |
|my_table/.hoodie/20230823042339397.replacecommit
|GET |5 |
|my_table/.hoodie/hoodie.properties
|GET |5 |
|my_table/site_id%253D21/42d99963-db7f-400f-9e33-d539c74672aa-0_0-79-6549_20230323023115954.parquet
|GET |3 |
|my_table/site_id%253D22/431ca0d1-8af3-4a72-bd17-31f2cd7e97e9-0_0-39-5633_20230323023017903.parquet
|GET |3 |
|my_table/site_id%253D23/36675fed-d8ab-4532-aedc-dddf0a32accb-0_1-80-6551_20230323024631265.parquet
|GET |3 |
|my_table/site_id%253D24/20efa6e4-489c-4b1a-a474-5dc1731485ed-0_0-80-6551_20230323041457900.parquet
|GET |3 |
|my_table/site_id%253D30/15bfe605-bced-4d17-b571-62ebe64c5e97-0_0-80-6552_20230823042339397.parquet
|GET |3 |
|my_table/site_id%253D30/27d907a2-7485-450d-9cdd-9f9c7e95fe88-0_0-39-5633_20230823044858848.parquet
|GET |3 |
|my_table/site_id%253D21/.hoodie_partition_metadata
|HEAD |1 |
|my_table/site_id%253D21/.hoodie_partition_metadata
|GET |1 |
|my_table/site_id%253D21/42d99963-db7f-400f-9e33-d539c74672aa-0_0-79-6549_20230323023115954.parquet
|HEAD |1 |
|my_table/site_id%253D22/.hoodie_partition_metadata
|HEAD |1 |
|my_table/site_id%253D22/.hoodie_partition_metadata
|GET |1 |
|my_table/site_id%253D22/431ca0d1-8af3-4a72-bd17-31f2cd7e97e9-0_0-39-5633_20230323023017903.parquet
|HEAD |1 |
|my_table/site_id%253D23/.hoodie_partition_metadata
|HEAD |1 |
|my_table/site_id%253D23/.hoodie_partition_metadata
|GET |1 |
|my_table/site_id%253D23/36675fed-d8ab-4532-aedc-dddf0a32accb-0_1-80-6551_20230323024631265.parquet
|HEAD |1 |
|my_table/site_id%253D24/.hoodie_partition_metadata
|HEAD |1 |
|my_table/site_id%253D24/.hoodie_partition_metadata
|GET |1 |
|my_table/site_id%253D24/20efa6e4-489c-4b1a-a474-5dc1731485ed-0_0-80-6551_20230323041457900.parquet
|HEAD |1 |
|my_table/site_id%253D30/.hoodie_partition_metadata
|HEAD |1 |
|my_table/site_id%253D30/.hoodie_partition_metadata
|GET |1 |
|my_table/site_id%253D30/15bfe605-bced-4d17-b571-62ebe64c5e97-0_0-80-6552_20230823042339397.parquet
|HEAD |1 |
|my_table/site_id%253D30/27d907a2-7485-450d-9cdd-9f9c7e95fe88-0_0-39-5633_20230823044858848.parquet
|HEAD |1 |
+---------------------------------------------------------------------------------------------------------------------------------+----------+---+
```
Why are there so many `HEAD` calls?
Why are there multiple `GET` calls per object?
I'm creating this ticket because we have significant number of S3 calls
across our Hudi tables which seem quite out of place given how many queries we
do. They are starting to have non-negligible cost implications and even managed
to cause throttling on S3 which impacted the Hudi job runs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]