[GitHub] [iceberg] can-sun opened a new issue, #6125: Encountered throttling when writting to S3 without repartitioning

GitBox Fri, 04 Nov 2022 14:25:03 -0700


can-sun opened a new issue, #6125:
URL: https://github.com/apache/iceberg/issues/6125


   ### Apache Iceberg version
   
   0.14.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I am using the following code snippet to batch write data to my S3 bucket 
and encountered the S3 throttling issue:
   
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 590 
in stage 7.0 failed 4 times, most recent failure: Lost task 590.3 in stage 7.0 
(TID 1750) (172.34.25.153 executor 108): 
software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your 
request rate. (Service: S3, Status Code: 503, Request ID: 0MYS30NPVXFFRM9R, 
Extended Request ID:  ****)
   ```
   
   Code snippet:
   
   ```java
   dataFrame
           .sortWithinPartitions(col(eventTimeFeatureName))
           .writeTo(f"$dataCatalogName.$dataBaseName.`$tableName`")
           .option("compression", "none")
           .append()
   ```
   
   `$dataCatalogName.$dataBaseName.$tableName` is the Iceberg table I created 
in glue and table is partitioned by truncating the column 
`eventTimeFeatureName`. Besides, we followed the practice mentioned 
[here](https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout) 
and added `write.object-storage.enabled'=true`. We verified the parquet files 
are written to s3 locations with random prefixes. However we still encounter 
the persistent failure of S3 throttling.
   
   The data file we used for test is about 8gb and eventTimeFeature spans 
across 1 year. To reduce number of files  to be written, I re-partitioned the 
input dataFrame and it works, however I believe this will greatly impact the 
performance. 
   
   ```
   tempDataFrame
           .withColumn("trunc_event_time", trunc(col(eventTimeFeatureName), 
"yyyy-MM-dd"))
           .repartition(col("trunc_event_time"))
           .drop(col("trunc_event_time"))
           .sortWithinPartitions(col(eventTimeFeatureName))
           .writeTo(f"$dataCatalogName.$dataBaseName.`$tableName`")
           .option("compression", "none")
           .append()
   ```
   
   Does iceberg team has any suggestions or best practices we can follow?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] can-sun opened a new issue, #6125: Encountered throttling when writting to S3 without repartitioning

Reply via email to