[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

via GitHub Fri, 10 Mar 2023 09:51:40 -0800


alippai commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464158098


   Looks like I have to learn a lot about repetition and definition levels, but 
also it looks like they can be RLE encoded which means practically zero 
overhead if not many nulls are used - it can be equal or similar to the 
non-nullable in the best scenario.
   
   I'm not a C++ coder, but summarizing the above discussion there are 2-3 fast 
paths missing at different hierarchy levels:
   
   1. Calculating the def and rep levels for 100k rows with all non-null values 
takes 2x time as reading 8M doubles (this is suspicious, but might be correct)
   2. Definition level data is all 0 and supposed to be RLE encoded and we 
might want to skip expanding it (maybe decode it as not nullable as checking 
all values is not needed?)
   3. Repetition level data is a vector of `0` followed by 79x`1` for our case. 
I'm not sure if RLE will help here, sounds like an unnecessary complex 
structure for fixed size lists. On the other hand reading or decoding it might 
be skipped as it can be derived from the metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] alippai commented on issue #34510: Reading FixedSizeList from parquet is slower than reading values into more rows

Reply via email to