alippai commented on issue #34510: URL: https://github.com/apache/arrow/issues/34510#issuecomment-1464158098
Looks like I have to learn a lot about repetition and definition levels, but also it looks like they can be RLE encoded which means practically zero overhead if not many nulls are used - it can be equal or similar to the non-nullable in the best scenario. I'm not a C++ coder, but summarizing the above discussion there are 2-3 fast paths missing at different hierarchy levels: 1. Calculating the def and rep levels for 100k rows with all non-null values takes 2x time as reading 8M doubles (this is suspicious, but might be correct) 2. Definition level data is all 0 and supposed to be RLE encoded and we might want to skip expanding it (maybe decode it as not nullable as checking all values is not needed?) 3. Repetition level data is a vector of `0` followed by 79x`1` for our case. I'm not sure if RLE will help here, sounds like an unnecessary complex structure for fixed size lists. On the other hand reading or decoding it might be skipped as it can be derived from the metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
