Re: [I] Adding a column to an existing table corrupts it in the glue catalog table [iceberg-python]

via GitHub Wed, 02 Jul 2025 01:47:39 -0700


mwa28 commented on issue #2166:
URL: 
https://github.com/apache/iceberg-python/issues/2166#issuecomment-3026948275


   Interesting, here is AWS support's take on the matter:
   
   Using my support tools, I observed that you executed a query running "SELECT 
* FROM "iceberg_database"."icebergCorrupted" limit 10", followed by query 
executing "OPTIMIZE iceberg_database.icebergCorrupted REWRITE DATA USING 
BIN_PACK WHERE col_9 = '2025-06-30'". After the optimization operation, Athena 
detected the table as corrupted.
   
   To investigate this issue further, I replicated your table structure using 
the manifest file you provided. I created a test table using the following 
PyIceberg script:
   
   ```python
   
   from pyiceberg.catalog import load_catalog
   from pyiceberg.schema import Schema, NestedField
   from pyiceberg.types import StringType
   from pyiceberg.partitioning import PartitionSpec, PartitionField
   from pyiceberg.transforms import IdentityTransform
   from pyiceberg.table.sorting import SortOrder, SortField
   
   glue_catalog = load_catalog(
       'iceberg_database',
       **{
           'client.access-key-id': '[REDACTED]',
           'client.secret-access-key': '[REDACTED]',
           'client.region': 'ca-central-1'
       },
       type='glue'
   )
   
   # Define your schema with all required fields
   schema = Schema(
       NestedField(1, "col_1", StringType(), required=True),
       NestedField(2, "col_2", StringType(), required=True),
       NestedField(3, "col_3", StringType(), required=True),
       NestedField(4, "col_4", StringType(), required=True),
       NestedField(5, "col_5", StringType(), required=True),
       NestedField(6, "col_6", StringType(), required=True),
       NestedField(7, "col_7", StringType(), required=True),
       NestedField(8, "col_8", StringType(), required=True),
       NestedField(9, "col_9", StringType(), required=True),
       NestedField(10, "col_10", StringType(), required=True),
       identifier_field_ids=[8],  # 'col_8' is the primary key
   )
   
   # Table location
   table_location = "s3://<my-bucket>/icebergCorrupted/"
   
   # Create the table
   iceberg_table = glue_catalog.create_table(
       identifier=("iceberg_database", "icebergCorrupted"),
       schema=schema,
       partition_spec=PartitionSpec(
           PartitionField(
               source_id=9,
               field_id=1000,
               transform=IdentityTransform(),
               name="col_9",
           ),
       ),
       sort_order=SortOrder(SortField(source_id=10, 
transform=IdentityTransform())),
       location=table_location,
       properties={
           "write.format.default": "parquet",
           "write.target-file-size-bytes": 134217728,  # 128 MB
           "write.metadata.delete-after-commit.enabled": True,
           "write.metadata.previous-versions-max": 5
       },
   )
   
   print(f"Iceberg table 'icebergCorrupted' created successfully in database 
'iceberg_database' at: {table_location}")
   
   ```
   
   Following the table creation, I performed schema modifications by adding 
four additional columns and removing the sort column through the Glue Catalog 
console.
   
   Subsequently, I executed the following code to verify the schema changes:
   
   ```python
   
   # Load an existing table
   table = glue_catalog.load_table(("iceberg_database", "icebergcorrupted"))
   
   # Get table metadata
   print(f"Table name: {table.name()}")
   print(f"Table schema: {table.schema()}")
   
   ```
   
   Root Cause Identified:
   
   The schema retrieved through PyIceberg did not reflect the changes made via 
the Glue Catalog console. This behavior is consistent with a known issue where 
schema changes made through the AWS Glue Data Catalog UI or PyIceberg library 
do not properly update the Iceberg table metadata files in S3, causing 
inconsistencies when querying through Athena.
   
   You can verify this issue by making schema changes through the Glue Data 
Catalog console and observing that no new manifest metadata files are generated 
in the S3 metadata directory.
   
   Recommended Solution:
   
   To resolve this issue and prevent future table corruption, all schema 
changes must be performed using Athena ALTER TABLE statements [1]. This 
approach ensures proper metadata updates and maintains consistency across all 
query engines.
   
   Example of proper schema modification using Athena:
   
   ```sql
   
   ALTER TABLE your_database.your_table_name 
   ADD COLUMN new_column_name string;
   
   ```
   
   This workaround will ensure that your Iceberg table metadata remains 
consistent and prevents the corruption issues you have experienced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Adding a column to an existing table corrupts it in the glue catalog table [iceberg-python]

Reply via email to