etseidl commented on PR #10007: URL: https://github.com/apache/arrow-rs/pull/10007#issuecomment-4526351737
Decoding BSS amounts to transposing a matrix. Each byte of source is used exactly once. The current algorithm in effect treats the source as `N = TYPE_SIZE` vectors and scans each sequentially. As long as `TYPE_SIZE` is not too large, there should not be any cache evictions happening in the tight inner loop. All this fix does is add another outer loop, making the indexing more complex in the process. When `TYPE_SIZE` is large enough, then you'll start getting L1 evictions happening in the inner loop, so after incrementing `i` each get will have to fetch an entire line again. But the fix here does the opposite, by blocking on the outer loop, rather than the inner bounded by `TYPE_SIZE`. Another issue is that the modified method is really only used for primitives...FLBA is handled elsewhere (at least for the arrow reader interface). The target for optimization should be `read_byte_stream_split()` in `parquet/src/arrow/array_reader/fixed_len_byte_array.rs`. Before embarking on this, one would need to do a detailed analysis of what value of `TYPE_LEN` is large enough to start generating cache misses. Only after demonstrating an actual slow down can one then make the case for further optimization to this part of the code. I have a feeling that there are other more important bottlenecks in the decoding of FLBA with BSS. I'm going to mark this as 'draft' for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
