westonpace edited a comment on pull request #12150:
URL: https://github.com/apache/arrow/pull/12150#issuecomment-1012676572
From these benchmarks I was a little surprised about how much impact reading
a small number of columns has on overall performance. Although for all
benchmarks except UncachedFile more columns means the metadata / data ratio is
bigger as well.
This is the motivation for ARROW-14577:
```
ReadUncachedFile/num_cols:1/is_partial:1/real_time 62928170 ns
18097667 ns 9 bytes_per_second=254.258M/s
ReadUncachedFile/num_cols:8/is_partial:1/real_time 133338397 ns
33394170 ns 4 bytes_per_second=119.995M/s
ReadUncachedFile/num_cols:64/is_partial:1/real_time 1467858164 ns
284143075 ns 1 bytes_per_second=87.2019M/s
ReadUncachedFileAsync/num_cols:1/is_partial:1/real_time 82149012 ns
4770464 ns 8 bytes_per_second=194.768M/s
ReadUncachedFileAsync/num_cols:8/is_partial:1/real_time 999669229 ns
25252003 ns 1 bytes_per_second=16.0053M/s
ReadUncachedFileAsync/num_cols:64/is_partial:1/real_time 8953151862 ns
122078755 ns 1 bytes_per_second=14.2966M/s
```
This is concerning since the two benchmarks should be doing the exact same
task (a partial read with 1 column should be the same as a full read with 1
column):
```
ReadMmapCachedFile/num_cols:1/is_partial:0/real_time 125726 ns
125677 ns 4359 bytes_per_second=124.278G/s
ReadMmapCachedFile/num_cols:1/is_partial:1/real_time 1404416 ns
1403848 ns 429 bytes_per_second=11.1256G/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]