mwa28 commented on issue #2166:
URL:
https://github.com/apache/iceberg-python/issues/2166#issuecomment-3026948275
Interesting, here is AWS support's take on the matter:
Using my support tools, I observed that you executed a query running "SELECT
* FROM "iceberg_database"."icebergCorrupted" limit 10", followed by query
executing "OPTIMIZE iceberg_database.icebergCorrupted REWRITE DATA USING
BIN_PACK WHERE col_9 = '2025-06-30'". After the optimization operation, Athena
detected the table as corrupted.
To investigate this issue further, I replicated your table structure using
the manifest file you provided. I created a test table using the following
PyIceberg script:
```python
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema, NestedField
from pyiceberg.types import StringType
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import IdentityTransform
from pyiceberg.table.sorting import SortOrder, SortField
glue_catalog = load_catalog(
'iceberg_database',
**{
'client.access-key-id': '[REDACTED]',
'client.secret-access-key': '[REDACTED]',
'client.region': 'ca-central-1'
},
type='glue'
)
# Define your schema with all required fields
schema = Schema(
NestedField(1, "col_1", StringType(), required=True),
NestedField(2, "col_2", StringType(), required=True),
NestedField(3, "col_3", StringType(), required=True),
NestedField(4, "col_4", StringType(), required=True),
NestedField(5, "col_5", StringType(), required=True),
NestedField(6, "col_6", StringType(), required=True),
NestedField(7, "col_7", StringType(), required=True),
NestedField(8, "col_8", StringType(), required=True),
NestedField(9, "col_9", StringType(), required=True),
NestedField(10, "col_10", StringType(), required=True),
identifier_field_ids=[8], # 'col_8' is the primary key
)
# Table location
table_location = "s3://<my-bucket>/icebergCorrupted/"
# Create the table
iceberg_table = glue_catalog.create_table(
identifier=("iceberg_database", "icebergCorrupted"),
schema=schema,
partition_spec=PartitionSpec(
PartitionField(
source_id=9,
field_id=1000,
transform=IdentityTransform(),
name="col_9",
),
),
sort_order=SortOrder(SortField(source_id=10,
transform=IdentityTransform())),
location=table_location,
properties={
"write.format.default": "parquet",
"write.target-file-size-bytes": 134217728, # 128 MB
"write.metadata.delete-after-commit.enabled": True,
"write.metadata.previous-versions-max": 5
},
)
print(f"Iceberg table 'icebergCorrupted' created successfully in database
'iceberg_database' at: {table_location}")
```
Following the table creation, I performed schema modifications by adding
four additional columns and removing the sort column through the Glue Catalog
console.
Subsequently, I executed the following code to verify the schema changes:
```python
# Load an existing table
table = glue_catalog.load_table(("iceberg_database", "icebergcorrupted"))
# Get table metadata
print(f"Table name: {table.name()}")
print(f"Table schema: {table.schema()}")
```
Root Cause Identified:
The schema retrieved through PyIceberg did not reflect the changes made via
the Glue Catalog console. This behavior is consistent with a known issue where
schema changes made through the AWS Glue Data Catalog UI or PyIceberg library
do not properly update the Iceberg table metadata files in S3, causing
inconsistencies when querying through Athena.
You can verify this issue by making schema changes through the Glue Data
Catalog console and observing that no new manifest metadata files are generated
in the S3 metadata directory.
Recommended Solution:
To resolve this issue and prevent future table corruption, all schema
changes must be performed using Athena ALTER TABLE statements [1]. This
approach ensures proper metadata updates and maintains consistency across all
query engines.
Example of proper schema modification using Athena:
```sql
ALTER TABLE your_database.your_table_name
ADD COLUMN new_column_name string;
```
This workaround will ensure that your Iceberg table metadata remains
consistent and prevents the corruption issues you have experienced.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]