EmilLindfors opened a new pull request, #1882:
URL: https://github.com/apache/iceberg-rust/pull/1882
## Overview
This PR adds deletion support for iceberg-rust, including position deletes,
equality deletes, deletion vectors (Puffin format), and `RowDeltaAction` for
atomic row-level changes.
## Related Issues
- Closes #1104 - Support RowDeltaAction
- Addresses #340 - Position delete writer improvements
- Addresses #1548 - Snapshot property tracking for deletion operations
## What's New
### 1. RowDeltaAction Implementation
Implementation of `RowDeltaAction` for atomic row-level changes with
serializable isolation guarantees.
**Features:**
- Add/remove data files and delete files in a single atomic transaction
- Conflict detection for concurrent operations
- Validation modes:
- `validate_data_files_exist()` - Ensures referenced files haven't been
removed
- `validate_deleted_files()` - Ensures files being removed haven't been
concurrently deleted
- `validate_no_concurrent_data_files()` - Detects concurrent data file
additions
- `validate_no_concurrent_delete_files()` - Detects concurrent delete file
additions
- `validate_from_snapshot()` - Sets base snapshot for conflict detection
- Conflict detection filters to scope validation by partition/predicate
- Snapshot property tracking
- 18 unit tests covering core functionality
**Use Cases:**
- UPDATE operations (add deletes for old values, optionally add new data
files)
- DELETE operations (add position or equality delete files)
- MERGE operations (combination of inserts, updates, deletes)
- Compaction (remove old files, add compacted files, preserve delete files)
### 2. Deletion Vector Integration
**Puffin Deletion Vector Support:**
- Consolidated deletion vector implementation in
`crates/iceberg/src/puffin/deletion_vector.rs`
- `DeletionVectorWriter` for creating Puffin files with deletion vectors
- Roaring64 bitmap encoding for efficient 64-bit position storage
- CRC-32 checksum validation
- Magic byte verification
**Delete File Index Integration:**
- O(1) lookup for deletion vectors by referenced data file path
- Proper handling of deletion vector metadata (`referenced_data_file`,
`content_offset`, `content_size_in_bytes`)
- Sequence number filtering for deletion vectors
- Clear separation between:
- Global equality deletes (unpartitioned equality deletes apply to all
partitions)
- Partition-scoped position deletes (including unpartitioned ones)
- Deletion vectors (Puffin-based position deletes with direct file
references)
### 3. Enhanced Transaction Support
**Snapshot Property Tracking:**
- Tracking of deletion-related statistics in `UpdateMetrics`
- Counters for:
- `deleted-records`, `deleted-data-files`
- `added-delete-files`, `removed-delete-files`
- Position and equality delete file counts (added/removed)
- Position and equality delete record counts (added/removed)
- Integration with all transaction types (FastAppend, Append, Delete,
RowDelta)
**Transaction Improvements:**
- Enhanced `AppendDeleteFilesAction` for committing delete files
- Proper manifest generation for delete files
- Support for both position and equality delete files
### 4. Integration Tests
**New Test Coverage:**
1. **`test_position_deletes_with_append_action`** - End-to-end position
delete workflow
2. **`test_equality_deletes_with_append_action`** - End-to-end equality
delete workflow
3. **`test_multiple_delete_files`** - Multiple delete files in single
transaction
4. **`test_deletion_vectors_with_puffin`** - Puffin deletion vector
write/commit/scan cycle
5. **`test_row_delta_add_delete_files`** - RowDeltaAction integration testing
**Test Infrastructure Improvements:**
- Added `ContainerRuntime` enum for Docker/Podman abstraction
- Podman support with localhost networking (WSL2 compatible)
- Docker-compose improvements:
- Healthchecks for REST catalog service
- Exposed MinIO port 9000
- Fully qualified image paths for Podman compatibility
All integration tests pass with both Docker and Podman.
## Implementation Details
### Delete File Index
Enhanced indexing logic for different delete file types:
```rust
// Equality deletes with empty partition → global (apply to all partitions)
// Position deletes with empty partition → partition-scoped (only apply to
unpartitioned files)
// Deletion vectors → indexed by referenced data file path for O(1) lookup
```
**Improvements:**
- Detection of deletion vectors based on `referenced_data_file` +
`content_offset` + `content_size_in_bytes`
- HashMap-based indexing for deletion vectors
- Correct application of spec rules for unpartitioned delete files
### Position Delete Writer
Enhancements:
- Handling of `referenced_data_file` optimization
- Automatic sorting by (file_path, pos)
- Batch tracking for multi-file optimization
- Spec-compliant field IDs (2147483546 for file_path, 2147483545 for pos)
## Breaking Changes
**Integration Test API Changes:**
- `get_shared_containers()` now requires `ContainerRuntime` parameter
- `random_ns()` now requires `ContainerRuntime` parameter
- `set_test_fixture()` now requires `ContainerRuntime` parameter
These changes only affect integration tests, not the public API.
## Testing
### Run All Tests
```bash
cargo test --package iceberg
cargo test --package iceberg-integration-tests
```
### Run Specific Integration Tests
```bash
# Delete files tests
cargo test --package iceberg-integration-tests --test shared delete_files --
--nocapture
# RowDelta tests
cargo test --package iceberg-integration-tests --test shared row_delta --
--nocapture
```
### Test Coverage
- **Unit tests**: 18 new tests for RowDeltaAction
- **Integration tests**: 5 end-to-end tests
- **Existing tests**: All passing (1000+ tests)
## File Changes Summary
### Added (3 files)
- `crates/iceberg/src/transaction/row_delta.rs` - RowDeltaAction
implementation
- `crates/integration_tests/tests/shared_tests/delete_files_test.rs` -
Delete file integration tests
- `crates/integration_tests/tests/shared_tests/row_delta_test.rs` - RowDelta
integration test
### Deleted (6 files)
- Internal documentation and example files
- Merged `deletion_vector_writer.rs` into `deletion_vector.rs`
- Replaced standalone deletion vector tests with integration tests
### Modified (22 files)
- Core deletion support infrastructure
- Transaction actions and snapshot tracking
- Integration test improvements for Podman/Docker compatibility
## Spec Compliance
This implementation follows Apache Iceberg specifications:
- Complete API matching Java reference implementation
- Conflict detection with serializable isolation guarantees
- Multiple validation modes for different use cases
- Full test coverage (unit and integration tests)
- Follows Iceberg table format v2/v3
- Efficient indexing and lookup structures
## Future Work
Not included in this PR:
- DataFusion equality delete join optimization (#1530)
- Delete file compaction
- Performance benchmarks
- Cross-compatibility testing with Java/Python implementations
## Documentation
All public APIs include rustdoc documentation with examples and usage notes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]