mr-brobot commented on PR #8104:
URL: https://github.com/apache/iceberg/pull/8104#issuecomment-1655738484

   ### Profiling via `yappi`
   
   On `master`:
   ```
   Clock type: CPU
   Ordered by: totaltime, desc
   
   name                                  ncall  tsub      ttot      tavg      
   <6 unrelated entries>
   ..g/io/pyarrow.py:741 _task_to_table  230    0.004176  0.284528  0.001237
   ```
   
   On this branch:
   ```
   Clock type: CPU
   Ordered by: totaltime, desc
   
   name                                  ncall  tsub      ttot      tavg      
   <27 unrelated entries>
   ..g/io/pyarrow.py:743 _task_to_table  12/1   0.000200  0.029458  0.002455
   ```
   
   Looks like calls to `_task_to_table` are being avoided, and less time is 
spent in that function. Average call time is slower but that's because most of 
the calls on master short-circuit immediately, skewing the average.
   
   ### Benchmarking via `hyperfine`
   
   On `master`:
   ```
   $ hyperfine --warmup 1 "python -m scripts.pyiceberg"
   Benchmark 1: python -m scripts.pyiceberg
     Time (mean ± σ):      7.773 s ±  0.462 s    [User: 17.551 s, System: 7.106 
s]
     Range (min … max):    7.160 s …  8.440 s    10 runs
   ```
   
   On this branch:
   ```
   hyperfine --warmup 1 "python -m scripts.pyiceberg"
   Benchmark 1: python -m scripts.pyiceberg
     Time (mean ± σ):      8.369 s ±  0.444 s    [User: 20.543 s, System: 8.006 
s]
     Range (min … max):    7.915 s …  9.346 s    10 runs
   ```
   
   This branch seems quite a bit slower for some reason. I was hoping for a 
significant performance improvement! I should figure this out. Sorry, @Fokko. 😢 
   
   ### Test Subject
   
   These tests are from running the following script on a >1B record table in 
S3 with 230 data files:
   ```python
   from pyiceberg.catalog.glue import GlueCatalog
   
   catalog = GlueCatalog("cloudbend")
   
   table = catalog.load_table("benchmark.nyc_taxi")
   
   result = table.scan(limit=10).to_arrow()
   
   assert len(result) == 10
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to