Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/22058 )
Change subject: WIP [docs] add information on nullable array data block ...................................................................... Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/22058/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22058/2//COMMIT_MSG@7 PS2, Line 7: array > nit: This may have already been answered before, for my understanding - doe No, it's not. Multi-dimensional arrays and other complex data structures require an additional layer (dealing with so-called 'definition level') that's orthogonal to this one. Basically, this work allows for one-dimensional arrays and also provides the basis for so-called 'repetition' level in terms of nested data structures representation introduced in Dremel and used in other projects like Parquet and Arrow. This (and related parts) might be a good read to get a broader context: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/ http://gerrit.cloudera.org:8080/#/c/22058/2/docs/design-docs/cfile.md File docs/design-docs/cfile.md: http://gerrit.cloudera.org:8080/#/c/22058/2/docs/design-docs/cfile.md@156 PS2, Line 156: 1101111 > +1 There isn't any logic -- these bitmaps are completely independent. http://gerrit.cloudera.org:8080/#/c/22058/2/docs/design-docs/cfile.md@156 PS2, Line 156: 1101111 The array bitmap and the flattened sequence bitmaps are completely independent. > Why doesn't the array null bitmap reflect that as well? Array bitmap provides the information on the nullability of arrays themselves, not elements in them. The bitmaps are independent -- that way it's much easier to interpret the contents. You can think of it like this: first, the full sequence is restored (will nulls) using the flattened bitmap. Now, using the array nullability bitmap and the information on the array start indices, arrays cells are being restored from the sequence that now contains null elements as well. http://gerrit.cloudera.org:8080/#/c/22058/2/docs/design-docs/cfile.md@169 PS2, Line 169: 5,6,7,8 > Can array elements be in random sequence or non-ascending order? Elements in array can be in any order -- that's just how they are represented in array data blocks, but the representation of those is always deterministic as per the documented spec here. http://gerrit.cloudera.org:8080/#/c/22058/2/docs/design-docs/cfile.md@166 PS2, Line 166: | [2, 2) | {} | : | [2, 2) | null | : | [2, 4) | { 3,4 } | : | [4, 8) | { 5,6,7,8 } | : | [8, 9) | { null } | > It would help to add a one liner definition for these (sort of a notation s Sure, I'll add this one even if it's easily deducible from the former example. As one can see, it would be: | field | value in human readable format for illustration | | --- | --- | | flatten sequence | 3,4 | | flatten value count | 2 | | flatten null bitmap length | 3 | | flatten null bitmap | 011 | | array start indices length | 4 | | array start indices | 0,0,0,0 | | array null bitmap | 1011 | -- To view, visit http://gerrit.cloudera.org:8080/22058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8972b3791d155e102240c80012e2b87192914cd1 Gerrit-Change-Number: 22058 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Comment-Date: Fri, 22 Nov 2024 18:59:57 +0000 Gerrit-HasComments: Yes
