[I] Reading FIXED_LEN_BYTE_ARRAY columns with nulls is inefficient [arrow-rs]

via GitHub Fri, 23 Aug 2024 14:36:23 -0700


etseidl opened a new issue, #6296:
URL: https://github.com/apache/arrow-rs/issues/6296


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   When reading a Parquet file with FIXED_LEN_BYTE_ARRAY columns with nulls 
present one necessary operation is moving the fixed-length data into the 
correct location within the output buffer to take into account null slots. This 
is handled by the 
[`pad_nulls`](https://github.com/apache/arrow-rs/blob/8c956a9f9ab26c14072740cce64c2b99cb039b13/parquet/src/arrow/array_reader/fixed_len_byte_array.rs#L237)
 function in the `ValuesBuffer` trait. The inner loop of this function
   ```rust
     for i in 0..byte_length {
         self.buffer[level_pos_bytes + i] = self.buffer[value_pos_bytes + i]
     }
   ```
   works well when the fixed width is low (`<= 4`), but for larger widths this 
loop is quite inefficient.
   
   **Describe the solution you'd like**
   Rewriting the inner loop for longer fixed-size arrays can speed this 
operation up considerably. In particular, by copying slices of the buffer to 
another location in the buffer, the compiler can vectorize the move, e.g.
   ```rust
     let split = self.buffer.split_at_mut(level_pos_bytes);
     let dst = &mut split.1[..byte_length];
     let src = &split.0[value_pos_bytes..value_pos_bytes + byte_length];
     for i in 0..byte_length {
         dst[i] = src[i]
     }
   ```
   
   **Describe alternatives you've considered**
   I tried 
[`Vec::copy_within`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.copy_within)
 but it was slower than the vectorized copy.
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Reading FIXED_LEN_BYTE_ARRAY columns with nulls is inefficient [arrow-rs]

Reply via email to