etseidl commented on PR #10007:
URL: https://github.com/apache/arrow-rs/pull/10007#issuecomment-4526351737

   Decoding BSS amounts to transposing a matrix. Each byte of source is used 
exactly once. The current algorithm in effect treats the source as `N = 
TYPE_SIZE` vectors and scans each sequentially. As long as `TYPE_SIZE` is not 
too large, there should not be any cache evictions happening in the tight inner 
loop. All this fix does is add another outer loop, making the indexing more 
complex in the process. When `TYPE_SIZE` is large enough, then you'll start 
getting L1 evictions happening in the inner loop, so after incrementing `i` 
each get will have to fetch an entire line again. But the fix here does the 
opposite, by blocking on the outer loop, rather than the inner bounded by 
`TYPE_SIZE`.
   
   Another issue is that the modified method is really only used for 
primitives...FLBA is handled elsewhere (at least for the arrow reader 
interface). The target for optimization should be `read_byte_stream_split()` in 
`parquet/src/arrow/array_reader/fixed_len_byte_array.rs`.
   
   Before embarking on this, one would need to do a detailed analysis of what 
value of `TYPE_LEN` is large enough to start generating cache misses. Only 
after demonstrating an actual slow down can one then make the case for further 
optimization to this part of the code. I have a feeling that there are other 
more important bottlenecks in the decoding of FLBA with BSS.
   
   I'm going to mark this as 'draft' for now.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to