ChristinaTech opened a new issue, #7151:
URL: https://github.com/apache/iceberg/issues/7151

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We recently encountered an issue whereby 
[GlueTableOperations](https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/glue/GlueTableOperations.java),
 while performing an Iceberg commit on behalf of 
[GlueCatalog](https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java),
 can incorrectly interpret a successful commit as a failure, and delete the 
now-current table metadata file as part of cleanup. This leaves the Iceberg 
table inaccessible as the "current metadata pointer" now points to a deleted 
metadata file. We were able to correct this via an engineer manually calling 
Glue APIs to correct the pointer to the previous metadata file, but this 
represents an availability risk to our data lake service.
   
   The reason this happens seems to be a direct result of the AWS client's 
default 3 attempts for a given API call, whereby Iceberg only looks at the 
exception thrown by the final attempt, as shown here:
   ```
   org.apache.iceberg.exceptions.CommitFailedException: Cannot commit 
catalog_name.database_name.table_name because Glue detected concurrent update
   Caused by: 
software.amazon.awssdk.services.glue.model.ConcurrentModificationException: 
Update table failed due to concurrent modifications. (Service: Glue, Status 
Code: 400, Request ID: <removed>)
   Suppressed: software.amazon.awssdk.core.exception.SdkClientException: 
Request attempt 1 failure: Unable to execute HTTP request: Read timed out
   Suppressed: software.amazon.awssdk.core.exception.SdkClientException: 
Request attempt 2 failure: Service returned error code 
ServiceUnavailableException (Service: Glue, Status Code: 500, Request ID: 
<removed>)
   ```
   
   We were very quickly able to determine no other writers were running on this 
table during the incident, which means the ConcurrentModificationException had 
to be from one of its own prior attempts updating the catalog despite returning 
a failure. If it had received the standard timeout exception, the exception 
logic would have correctly called checkCommitStatus and determined the commit 
was actually successful. However, as it only saw the 
ConcurrentModificationException from the final attempt, it treated the commit 
as failed and performed cleanup it should not have done. Notably, it would have 
also exhibited this incorrect behavior if the ServiceUnavailableException had 
been the last attempt.
   
   As expected, Iceberg attempts to refresh its metadata and retry the commit. 
Unfortunately, it just deleted the object the metadata pointer directs to, 
resulting in:
   ```
   org.apache.spark.SparkException: Writing job aborted
   Caused by: org.apache.iceberg.exceptions.NotFoundException: Location does 
not exist: 
s3://fake-bucket-name/database_name.db/table_name/metadata/06814-ec5ff66c-af38-492c-ba38-55610536d9a7.metadata.json
   Caused by: software.amazon.awssdk.services.s3.model.NoSuchKeyException: The 
specified key does not exist. (Service: S3, Status Code: 404, Request ID: 
<removed>, Extended Request ID: <removed>)
   ```
   
   While investigating, I noticed the same sequence of events would also cause 
[DynamoDbTableOperations](https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/dynamodb/DynamoDbTableOperations.java),
 which uses an AWS client configured in the same way, to take the same 
incorrect action, with the same outcome of the table becoming inaccessible.
   
   Note: Removed some solution-specific information from error logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to