can-sun opened a new issue, #6125:
URL: https://github.com/apache/iceberg/issues/6125
### Apache Iceberg version
0.14.1
### Query engine
Spark
### Please describe the bug 🐞
I am using the following code snippet to batch write data to my S3 bucket
and encountered the S3 throttling issue:
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 590
in stage 7.0 failed 4 times, most recent failure: Lost task 590.3 in stage 7.0
(TID 1750) (172.34.25.153 executor 108):
software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your
request rate. (Service: S3, Status Code: 503, Request ID: 0MYS30NPVXFFRM9R,
Extended Request ID: ****)
```
Code snippet:
```java
dataFrame
.sortWithinPartitions(col(eventTimeFeatureName))
.writeTo(f"$dataCatalogName.$dataBaseName.`$tableName`")
.option("compression", "none")
.append()
```
`$dataCatalogName.$dataBaseName.$tableName` is the Iceberg table I created
in glue and table is partitioned by truncating the column
`eventTimeFeatureName`. Besides, we followed the practice mentioned
[here](https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout)
and added `write.object-storage.enabled'=true`. We verified the parquet files
are written to s3 locations with random prefixes. However we still encounter
the persistent failure of S3 throttling.
The data file we used for test is about 8gb and eventTimeFeature spans
across 1 year. To reduce number of files to be written, I re-partitioned the
input dataFrame and it works, however I believe this will greatly impact the
performance.
```
tempDataFrame
.withColumn("trunc_event_time", trunc(col(eventTimeFeatureName),
"yyyy-MM-dd"))
.repartition(col("trunc_event_time"))
.drop(col("trunc_event_time"))
.sortWithinPartitions(col(eventTimeFeatureName))
.writeTo(f"$dataCatalogName.$dataBaseName.`$tableName`")
.option("compression", "none")
.append()
```
Does iceberg team has any suggestions or best practices we can follow?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]