MrDerecho opened a new issue, #2944:
URL: https://github.com/apache/iceberg-python/issues/2944
### Apache Iceberg version
0.10.0 (latest release)
### Please describe the bug 🐞
## Summary
`expire_snapshots()` fails on tables with statistics metadata when using
PyIceberg 0.10.0 with Pydantic 2.x due to incorrect positional argument usage
in `RemoveStatisticsUpdate` initialization.
## Environment
- **PyIceberg Version:** 0.10.0
- **Pydantic Version:** 2.12.4
- **Python Version:** 3.13.3
- **Platform:** macOS
- **Catalog Type:** AWS Glue
## Description
When calling `expire_snapshots()` on Iceberg tables that contain statistics
metadata, the operation fails with a Pydantic validation error. The error only
occurs on tables with statistics files in their metadata (e.g., tables that
have undergone Trino compaction operations with statistics collection enabled).
## Error
```python
Traceback (most recent call last):
File "/lib/python3.13/site-packages/pyiceberg/table/__init__.py", line
1208, in commit
return self._apply(self._transaction._table_metadata)
File "/lib/python3.13/site-packages/pyiceberg/table/__init__.py", line
1182, in _apply
updated_metadata = update_table_metadata(
File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py",
line 195, in update_table_metadata
new_metadata = _apply_table_update(update, base_metadata, context)
File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py",
line 490, in _apply_table_update
for upd in updates:
File "/lib/python3.13/site-packages/pyiceberg/table/update/__init__.py",
line 505, in <genexpr>
RemoveStatisticsUpdate(statistics_file.snapshot_id)
TypeError: BaseModel.__init__() takes 1 positional argument but 2 were given
```
## Steps to Reproduce
1. Create an Iceberg table with statistics metadata (e.g., via Trino
compaction with statistics collection)
2. Verify table has statistics files:
```python
from pyiceberg.catalog import load_catalog
catalog = load_catalog('glue', **{'type': 'glue', 'region_name':
'us-east-1'})
table = catalog.load_table('database.table_name')
print(f"Statistics files: {len(table.metadata.statistics)}") # Should be
> 0
```
3. Attempt to expire snapshots:
```python
table.manage_snapshots().expire_snapshots().retain_last(3).commit()
```
4. Observe TypeError
## Root Cause
In `pyiceberg/table/update/__init__.py` at line 505,
`RemoveStatisticsUpdate` is instantiated with a positional argument:
```python
remove_statistics_updates = (
RemoveStatisticsUpdate(statistics_file.snapshot_id) # ❌ Positional
argument
for statistics_file in base_metadata.statistics
if statistics_file.snapshot_id in update.snapshot_ids
)
```
However, `RemoveStatisticsUpdate` is a Pydantic 2.x `BaseModel`, which
requires keyword arguments for initialization. This is a breaking change from
Pydantic 1.x.
## Expected Behavior
`expire_snapshots()` should successfully remove expired snapshots and their
associated statistics metadata without errors.
## Actual Behavior
Operation fails with `TypeError: BaseModel.__init__() takes 1 positional
argument but 2 were given`.
## Proposed Fix
Change the positional argument to a keyword argument:
```python
remove_statistics_updates = (
RemoveStatisticsUpdate(snapshot_id=statistics_file.snapshot_id) # ✓
Keyword argument
for statistics_file in base_metadata.statistics
if statistics_file.snapshot_id in update.snapshot_ids
)
```
**File:** `pyiceberg/table/update/__init__.py`
**Line:** 505
## Impact
This bug affects any table that has statistics metadata, which occurs when:
- Trino performs compaction operations with statistics collection enabled
- Statistics are explicitly written to table metadata
- Tables are managed with tools that generate statistics files
In environments with automated compaction jobs, this prevents snapshot
expiration from functioning, leading to unbounded metadata growth.
## Workaround
Manually patch the installed package:
```bash
# Find the installation path
python -c "import pyiceberg.table.update; import os;
print(os.path.dirname(pyiceberg.table.update.__file__))"
# Apply the fix (adjust path accordingly)
sed -i
's/RemoveStatisticsUpdate(statistics_file.snapshot_id)/RemoveStatisticsUpdate(snapshot_id=statistics_file.snapshot_id)/'
\
<path-to-site-packages>/pyiceberg/table/update/__init__.py
```
## Additional Context
### Why This Bug May Go Unnoticed
Most Iceberg tables do not have statistics metadata:
- Standard `APPEND` operations don't create statistics
- Only specific operations (like Trino compaction) generate statistics
- The bug only triggers when tables have statistics AND snapshots are expired
In our environment testing 10 tables, only 1 had statistics metadata (from a
dedicated Trino compaction job), making this a rare but critical failure mode.
### Pydantic 2.x Migration
This appears to be an incomplete migration to Pydantic 2.x. While most of
PyIceberg correctly uses keyword arguments with Pydantic models, this specific
instance was missed.
Similar issues may exist elsewhere in the codebase where Pydantic models are
instantiated with positional arguments.
## Related Code
`RemoveStatisticsUpdate` class definition (appears to be correctly defined
as a Pydantic model):
```python
class RemoveStatisticsUpdate(TableUpdate):
snapshot_id: int
```
The issue is purely in the instantiation at line 505, not the class
definition.
## Verification
After applying the fix:
- All 10 test tables successfully expire snapshots
- Table with statistics metadata (1096 snapshots) correctly reduced to 2
snapshots with `retain_last(2)`
- Statistics metadata properly cleaned up alongside expired snapshots
- All tables remain readable after expiration
---
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [x] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]