[GitHub] [arrow] tustvold commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

via GitHub Fri, 10 Mar 2023 10:29:22 -0800


tustvold commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464215953


   > @alamb @tustvold I saw your blog post about this for arrow-rs. Do you 
handle this differently in Rust?
   
   We don't support FixedSizeList in arrow-rs AFAIK. Parquet to my knowledge 
does not have an equivalent 
[logical](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) 
construct, and so it isn't particularly clear to me what support would mean 
other than implicitly casting between a regular list and a fixed size list.
   
   > Calculating the def and rep levels for 100k rows with all non-null values 
takes 2x time as reading 8M doubles
   
   Assuming the doubles are PLAIN encoded this is not surprising, you are 
comparing the performance of what is effectively a `memcpy` that will run at 
the memory bandwidth, to a fairly complex 
[bit-packing](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
 scheme used for the definition and repetition levels.
   
   In the Rust implementation we have a couple of tricks that help here, but it 
is still relatively expensive (at least compared to primitive decoding):
   
   * We decode definition levels directly to the null buffer if there are only 
nulls at the leaf level (i.e. no lists or nested nulls), allowing us to 
preserve the bit-packing
   * We have vectorised unpack implementations specialised for each bit width 
(I believe arrow C++ does this also)
   
   > Definition level data is all 1 and supposed to be RLE encoded
   
   It will actually all be 2, unless the doubles are themselves not nullable
   
   > Repetition level data is a vector of 0 followed by 79x1 repeated 100k 
times for our case. I'm not sure if RLE will help here, sounds like an 
unnecessary complex structure for fixed size lists
   
   These repetition levels will be [RLE 
encoded](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3).
 Theoretically a reader could preserve this, but the record shredding logic is 
extremely fiddly and so might run the risk of adding complexity to an already 
very complex piece of code. At least in arrow-rs we decode to an array of `i16`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] tustvold commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Reply via email to