gabeiglio opened a new issue, #3039: URL: https://github.com/apache/iceberg-python/issues/3039
### Feature Request / Improvement We’ve observed a large performance gap between the Python and Java implementations for logical overwrites (metadata-only). Profiling shows most time is spent in `snapshot.py` (`_manifests`), where we are not pruning manifests when computing `_existing_manifests` and `_deleted_entries`. After adding manifest pruning, we see the following benchmark results (100 overwrite iterations): | Scenario | Avg (s) | Min (s) | Max (s) | |----------------------------------------------|---------|---------|---------| | Current PyIceberg – same partition | 1.15 | 0.78 | 1.51 | | Current PyIceberg – random partitions | 0.96 | 0.77 | 1.26 | | Pruning PyIceberg – same partition | 0.50 | 0.28 | 0.78 | | Pruning PyIceberg – random partitions | 0.38 | 0.27 | 0.49 | Benchmark script: https://gist.github.com/gabeiglio/0092970c144228ef6d333a873dc1d316 Here is the [PR](https://github.com/apache/iceberg-python/pull/3011) for the optimization -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
