gabeiglio opened a new issue, #3039:
URL: https://github.com/apache/iceberg-python/issues/3039

   ### Feature Request / Improvement
   
   We’ve observed a large performance gap between the Python and Java 
implementations for logical overwrites (metadata-only). Profiling shows most 
time is spent in `snapshot.py` (`_manifests`), where we are not pruning 
manifests when computing `_existing_manifests` and `_deleted_entries`.
   
   After adding manifest pruning, we see the following benchmark results (100 
overwrite iterations):
   
   | Scenario                                      | Avg (s) | Min (s) | Max 
(s) |
   
|----------------------------------------------|---------|---------|---------|
   | Current PyIceberg – same partition           | 1.15    | 0.78    | 1.51    
|
   | Current PyIceberg – random partitions        | 0.96    | 0.77    | 1.26    
|
   | Pruning PyIceberg – same partition           | 0.50    | 0.28    | 0.78    
|
   | Pruning PyIceberg – random partitions        | 0.38    | 0.27    | 0.49    
|
   
   Benchmark script: 
https://gist.github.com/gabeiglio/0092970c144228ef6d333a873dc1d316
   
   Here is the [PR](https://github.com/apache/iceberg-python/pull/3011) for the 
optimization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to