ChristinaTech opened a new issue, #7151: URL: https://github.com/apache/iceberg/issues/7151
### Apache Iceberg version 1.1.0 (latest release) ### Query engine Spark ### Please describe the bug 🐞 We recently encountered an issue whereby [GlueTableOperations](https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/glue/GlueTableOperations.java), while performing an Iceberg commit on behalf of [GlueCatalog](https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java), can incorrectly interpret a successful commit as a failure, and delete the now-current table metadata file as part of cleanup. This leaves the Iceberg table inaccessible as the "current metadata pointer" now points to a deleted metadata file. We were able to correct this via an engineer manually calling Glue APIs to correct the pointer to the previous metadata file, but this represents an availability risk to our data lake service. The reason this happens seems to be a direct result of the AWS client's default 3 attempts for a given API call, whereby Iceberg only looks at the exception thrown by the final attempt, as shown here: ``` org.apache.iceberg.exceptions.CommitFailedException: Cannot commit catalog_name.database_name.table_name because Glue detected concurrent update Caused by: software.amazon.awssdk.services.glue.model.ConcurrentModificationException: Update table failed due to concurrent modifications. (Service: Glue, Status Code: 400, Request ID: <removed>) Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Service returned error code ServiceUnavailableException (Service: Glue, Status Code: 500, Request ID: <removed>) ``` We were very quickly able to determine no other writers were running on this table during the incident, which means the ConcurrentModificationException had to be from one of its own prior attempts updating the catalog despite returning a failure. If it had received the standard timeout exception, the exception logic would have correctly called checkCommitStatus and determined the commit was actually successful. However, as it only saw the ConcurrentModificationException from the final attempt, it treated the commit as failed and performed cleanup it should not have done. Notably, it would have also exhibited this incorrect behavior if the ServiceUnavailableException had been the last attempt. As expected, Iceberg attempts to refresh its metadata and retry the commit. Unfortunately, it just deleted the object the metadata pointer directs to, resulting in: ``` org.apache.spark.SparkException: Writing job aborted Caused by: org.apache.iceberg.exceptions.NotFoundException: Location does not exist: s3://fake-bucket-name/database_name.db/table_name/metadata/06814-ec5ff66c-af38-492c-ba38-55610536d9a7.metadata.json Caused by: software.amazon.awssdk.services.s3.model.NoSuchKeyException: The specified key does not exist. (Service: S3, Status Code: 404, Request ID: <removed>, Extended Request ID: <removed>) ``` While investigating, I noticed the same sequence of events would also cause [DynamoDbTableOperations](https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/dynamodb/DynamoDbTableOperations.java), which uses an AWS client configured in the same way, to take the same incorrect action, with the same outcome of the table becoming inaccessible. Note: Removed some solution-specific information from error logs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
