[GitHub] [arrow] westonpace commented on pull request #11616: ARROW-14577: [C++] Enable fine grained IO for async IPC reader

GitBox Wed, 15 Dec 2021 17:12:08 -0800


westonpace commented on pull request #11616:
URL: https://github.com/apache/arrow/pull/11616#issuecomment-995349558



   I dug into the performance a bit more for the small files case (I'll do S3 
soon but I think I want to do real S3 and not minio since the former supports 
parallelism and the latter, attached to my HDD, does not).
   
   Note: Asynchronous readers in these tests are not being consumed in 
parallel.  So we wait until batch is returned before reading the next batch.  
However, asynchronous readers still issue parallel reads and use threads.  
Reading a single batch that needs 8 columns will trigger 8 parallel reads.
   
   Note: Even the synchronous reader will use parallel reads if only a subset 
of the columns are targeted.  It will use the IoRecordedRandomAccessFile which 
then uses the read range cache which performs reads in parallel. 
   
   ### Hot In-Memory Memory Mapped (also, arrow::io::BufferReader)
   
   Asynchronous reads should never be used in this case.  A "read" is just 
pointer arithmetic.  There are no copies.  I didn't benchmark this case.
   
   ### Cold On-Disk Memory Mapped
   
   I did not test this.  I'm not sure if it is an interesting case or not.
   
   ### Hot In-Memory Regular File
   
   Cherry picking some interesting cases (note the rate here is based on the 
total buffer size of the selected columns.  So selecting fewer columns 
shouldn't yield a higher rate).
   
   | Sync/Async | # of columns | # of columns selected | Rate (Bps) | Note |
   | - | - | - | - | - |
   | Sync | 16 | 16 | 9.79967G/s | Seems a limit on my machine for 1-thread 
DRAM bandwidth |
   | Sync | 16 | 2 | 12.8979G/s | Parallel reads increase DRAM bandwidth |
   | Sync | 256 | 256 | 8.73684G/s | Starting to hit CPU bottleneck from excess 
metadata |
   | Sync | 256 | 32 | 7.28792G/s | Since we are throttled on metadata / CPU, 
perf gets worse |
   | Async | 16 | 16 | 2.58248G/s | Async is quite a bit worse than baseline 
for full reads |
   | Async | 16 | 2 | 13.9343G/s | Async perf is similar on partial reads |
   | Async | 256 | 256 | 2.4068G/s | |
   | Async | 256 | 32 | 6.8774G/s | |
   | Old-Async | 16 | 16 | 2.84301G/s | Old implementation has slightly lower 
overhead I think |
   | Old-Async | 16 | 2 | 556.501M/s | Old implementation does not handle 
partial reads well |
   | Old-Async | 256 | 256 | 2.78802G/s | |
   | Old-Async | 256 | 32 | 459.484M/s | |
   
   Conclusions: This change significnatly improves performance of partial async 
reads to the point where partial async reads on "well-formed" files (data >> 
metadata) is comparable to sync partial read.
   
   Async full read is still considerably worse than async full read which is 
surprising but possibly due to threading overhead.  This is worth investigating 
in a future PR.
   
   ### Cold In-Memory Regular File
   
   | Sync/Async | # of columns | # of columns selected | Rate (bps) | Note |
   | - | - | - | - | - |
   | Sync | 16 | 16 | 111.044M/s | Baseline, HDD throughput |
   | Sync | 16 | 2 | 25.205M/s | Surprising, more below |
   | Sync | 256 | 256 | 99.8336M/s | |
   | Sync | 256 | 32 | 15.2597M/s | Surprising |
   | Async | 16 | 16 | 98.5425M/s | Similar to sync, within noise but did 
consistently seem a bit lower |
   | Async | 16 | 2 | 54.1136M/s | |
   | Async | 256 | 256 | 96.5957M/s |
   | Async | 256 | 32 | 11.911M/s | Within noise of sync result actually, seems 
to bottom out around a noisy 10-16 |
   | Old-Async | 16 | 16 | 138.266M/s | Not just noise, old async real-file is 
consistently better than sync |
   | Old-Async | 16 | 2 | 17.4384M/s | |
   | Old-Async | 256 | 256 | 123.721M/s | |
   | Old-Async | 256 | 32 | 16.4605M/s | |
   
   Conclusions: This change does improve performance of partial async reads.  
However, it seems to come at a cost of full async reads.  David's suggest to 
falling back to a full file read should alleviate this.
   
   In all cases the performance of partial reads deteriorates quickly.  This is 
because we are essentially falling back to either "reading too much" 
(Old-Async) or random reads.  The random read rate lines up with using `fio` to 
benchmark my disk.  At 16 batches the data blocks are 520KB and with `fio` 
random reads@520KB ~ 45MBps.  At 256 batches the data blocks are 32k and with 
`fio` I get ~4MBps (either `fio` is too pessimistic or we are able to take 
advantage of the pseudo-sequential nature of the reads).
   
   ### Remaining tasks
   
   - [ ] Add fallback to full-file read for async
   - [ ] Investigate S3
   - [ ] Investigate multi-threaded local reads (both multiple files and 
consuming in parallel)
   - [ ] Recommend that users should structure record batches so that each 
column contains at least 4MB of data if they plan to be reading from disk.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #11616: ARROW-14577: [C++] Enable fine grained IO for async IPC reader

Reply via email to