mr-brobot commented on PR #8104:
URL: https://github.com/apache/iceberg/pull/8104#issuecomment-1655738484
### Profiling via `yappi`
On `master`:
```
Clock type: CPU
Ordered by: totaltime, desc
name ncall tsub ttot tavg
<6 unrelated entries>
..g/io/pyarrow.py:741 _task_to_table 230 0.004176 0.284528 0.001237
```
On this branch:
```
Clock type: CPU
Ordered by: totaltime, desc
name ncall tsub ttot tavg
<27 unrelated entries>
..g/io/pyarrow.py:743 _task_to_table 12/1 0.000200 0.029458 0.002455
```
Looks like calls to `_task_to_table` are being avoided, and less time is
spent in that function. Average call time is slower but that's because most of
the calls on master short-circuit immediately, skewing the average.
### Benchmarking via `hyperfine`
On `master`:
```
$ hyperfine --warmup 1 "python -m scripts.pyiceberg"
Benchmark 1: python -m scripts.pyiceberg
Time (mean ± σ): 7.773 s ± 0.462 s [User: 17.551 s, System: 7.106
s]
Range (min … max): 7.160 s … 8.440 s 10 runs
```
On this branch:
```
hyperfine --warmup 1 "python -m scripts.pyiceberg"
Benchmark 1: python -m scripts.pyiceberg
Time (mean ± σ): 8.369 s ± 0.444 s [User: 20.543 s, System: 8.006
s]
Range (min … max): 7.915 s … 9.346 s 10 runs
```
This branch seems quite a bit slower for some reason. I was hoping for a
significant performance improvement! I should figure this out. Sorry, @Fokko. 😢
### Test Subject
These tests are from running the following script on a >1B record table in
S3 with 230 data files:
```python
from pyiceberg.catalog.glue import GlueCatalog
catalog = GlueCatalog("cloudbend")
table = catalog.load_table("benchmark.nyc_taxi")
result = table.scan(limit=10).to_arrow()
assert len(result) == 10
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]