westonpace commented on pull request #11616: URL: https://github.com/apache/arrow/pull/11616#issuecomment-1012723649
I split the benchmarks into a separate PR (#12150). I did a bit more analysis today. There is a substantial performance loss in a few situations: Async full file reads (e.g. when reading all columns): David's suggestion will probably work here but it's going to be a bit tricky to implement. Right now each time we load a shared record batch we are using a dedicated read cache so there is no single read cache to mark "cache" on the whole file. I plan to look at this more tomorrow. Reading from a buffer (e.g. when doing no I/O at all). For example: Old: ``` ReadBufferAsync/num_cols:1/is_partial:0/iterations:50000/real_time_mean 3623 ns 3623 ns 10 bytes_per_second=269.542G/s ``` New: ``` ReadBufferAsync/num_cols:1/is_partial:0/iterations:50000/real_time_mean 8651 ns 8651 ns 10 bytes_per_second=112.89G/s ``` In this particular case we are over 2x slower. Some of this slowdown is because the read range cache is calling "file->WillNeed" on these regions which triggers an madvise (which seems to mainly eat up time purely by virtue of being a system call). Removing that call gets us to `159G/s` although I'm not really sure if that's the right path to take. I'm pretty sure the rest of the time is lost because we are using more futures which means more allocation and shared_ptr. There is no quick fix for that but I am thinking I want to tackle Future improvements in 8.0.0. There's a Windows build error I will fix. At the moment I am leaning towards including this but kind of split. The slowdowns are on an already lightning fast path (e.g. we are going from 4000ns to 8000ns for a zero-copy buffer read) for an operation we aren't yet calling in any real critical section (these calls are per-batch). The speedup is on a very slow path (e.g. going from 7.8 seconds to 1.7 seconds on 1G file read because we're reading 8 columns instead of 64 columns) but maybe not as common of one for IPC. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org