MrDerecho opened a new issue, #2944:
URL: https://github.com/apache/iceberg-python/issues/2944

   ### Apache Iceberg version
   
   0.10.0 (latest release)
   
   ### Please describe the bug 🐞
   
   ## Summary
   
   `expire_snapshots()` fails on tables with statistics metadata when using 
PyIceberg 0.10.0 with Pydantic 2.x due to incorrect positional argument usage 
in `RemoveStatisticsUpdate` initialization.
   
   ## Environment
   
   - **PyIceberg Version:** 0.10.0
   - **Pydantic Version:** 2.12.4
   - **Python Version:** 3.13.3
   - **Platform:** macOS
   - **Catalog Type:** AWS Glue
   
   ## Description
   
   When calling `expire_snapshots()` on Iceberg tables that contain statistics 
metadata, the operation fails with a Pydantic validation error. The error only 
occurs on tables with statistics files in their metadata (e.g., tables that 
have undergone Trino compaction operations with statistics collection enabled).
   
   ## Error
   
   ```python
   Traceback (most recent call last):
     File "/lib/python3.13/site-packages/pyiceberg/table/__init__.py", line 
1208, in commit
       return self._apply(self._transaction._table_metadata)
     File "/lib/python3.13/site-packages/pyiceberg/table/__init__.py", line 
1182, in _apply
       updated_metadata = update_table_metadata(
     File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", 
line 195, in update_table_metadata
       new_metadata = _apply_table_update(update, base_metadata, context)
     File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", 
line 490, in _apply_table_update
       for upd in updates:
     File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py", 
line 505, in <genexpr>
       RemoveStatisticsUpdate(statistics_file.snapshot_id)
   TypeError: BaseModel.__init__() takes 1 positional argument but 2 were given
   ```
   
   ## Steps to Reproduce
   
   1. Create an Iceberg table with statistics metadata (e.g., via Trino 
compaction with statistics collection)
   2. Verify table has statistics files:
      ```python
      from pyiceberg.catalog import load_catalog
      
      catalog = load_catalog('glue', **{'type': 'glue', 'region_name': 
'us-east-1'})
      table = catalog.load_table('database.table_name')
      print(f"Statistics files: {len(table.metadata.statistics)}")  # Should be 
> 0
      ```
   3. Attempt to expire snapshots:
      ```python
      table.manage_snapshots().expire_snapshots().retain_last(3).commit()
      ```
   4. Observe TypeError
   
   ## Root Cause
   
   In `pyiceberg/table/update/__init__.py` at line 505, 
`RemoveStatisticsUpdate` is instantiated with a positional argument:
   
   ```python
   remove_statistics_updates = (
       RemoveStatisticsUpdate(statistics_file.snapshot_id)  # ❌ Positional 
argument
       for statistics_file in base_metadata.statistics
       if statistics_file.snapshot_id in update.snapshot_ids
   )
   ```
   
   However, `RemoveStatisticsUpdate` is a Pydantic 2.x `BaseModel`, which 
requires keyword arguments for initialization. This is a breaking change from 
Pydantic 1.x.
   
   ## Expected Behavior
   
   `expire_snapshots()` should successfully remove expired snapshots and their 
associated statistics metadata without errors.
   
   ## Actual Behavior
   
   Operation fails with `TypeError: BaseModel.__init__() takes 1 positional 
argument but 2 were given`.
   
   ## Proposed Fix
   
   Change the positional argument to a keyword argument:
   
   ```python
   remove_statistics_updates = (
       RemoveStatisticsUpdate(snapshot_id=statistics_file.snapshot_id)  # ✓ 
Keyword argument
       for statistics_file in base_metadata.statistics
       if statistics_file.snapshot_id in update.snapshot_ids
   )
   ```
   
   **File:** `pyiceberg/table/update/__init__.py`  
   **Line:** 505
   
   ## Impact
   
   This bug affects any table that has statistics metadata, which occurs when:
   - Trino performs compaction operations with statistics collection enabled
   - Statistics are explicitly written to table metadata
   - Tables are managed with tools that generate statistics files
   
   In environments with automated compaction jobs, this prevents snapshot 
expiration from functioning, leading to unbounded metadata growth.
   
   ## Workaround
   
   Manually patch the installed package:
   
   ```bash
   # Find the installation path
   python -c "import pyiceberg.table.update; import os; 
print(os.path.dirname(pyiceberg.table.update.__file__))"
   
   # Apply the fix (adjust path accordingly)
   sed -i 
's/RemoveStatisticsUpdate(statistics_file.snapshot_id)/RemoveStatisticsUpdate(snapshot_id=statistics_file.snapshot_id)/'
 \
     <path-to-site-packages>/pyiceberg/table/update/__init__.py
   ```
   
   ## Additional Context
   
   ### Why This Bug May Go Unnoticed
   
   Most Iceberg tables do not have statistics metadata:
   - Standard `APPEND` operations don't create statistics
   - Only specific operations (like Trino compaction) generate statistics
   - The bug only triggers when tables have statistics AND snapshots are expired
   
   In our environment testing 10 tables, only 1 had statistics metadata (from a 
dedicated Trino compaction job), making this a rare but critical failure mode.
   
   ### Pydantic 2.x Migration
   
   This appears to be an incomplete migration to Pydantic 2.x. While most of 
PyIceberg correctly uses keyword arguments with Pydantic models, this specific 
instance was missed.
   
   Similar issues may exist elsewhere in the codebase where Pydantic models are 
instantiated with positional arguments.
   
   ## Related Code
   
   `RemoveStatisticsUpdate` class definition (appears to be correctly defined 
as a Pydantic model):
   
   ```python
   class RemoveStatisticsUpdate(TableUpdate):
       snapshot_id: int
   ```
   
   The issue is purely in the instantiation at line 505, not the class 
definition.
   
   ## Verification
   
   After applying the fix:
   - All 10 test tables successfully expire snapshots
   - Table with statistics metadata (1096 snapshots) correctly reduced to 2 
snapshots with `retain_last(2)`
   - Statistics metadata properly cleaned up alongside expired snapshots
   - All tables remain readable after expiration
   
   ---
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to