Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/22058 )
Change subject: WIP [docs] add information on nullable array data block ...................................................................... Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/22058/1/docs/design-docs/cfile.md File docs/design-docs/cfile.md: http://gerrit.cloudera.org:8080/#/c/22058/1/docs/design-docs/cfile.md@151 PS1, Line 151: array start indices > When storing the starting indices of each array, we might end up storing la That's a good point. I think we can consider this option to express coordinates of array cells as well. Originally, I was thinking of expressing the offsets in rather "absolute" than "relative" coordinates. That seems better because of convenient access to a particular cell in the flattened sequence (i.e. fetching the data of particular array): there isn't a need to go through the elements of the array cells coordinates from the very beginning to find the position in the flattened sequence. I recall Arrow uses similar notation (https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout) in offset buffers for variable-sized lists, and I thought to apply similar approach here as well. However, so far it seems using sizes instead of indices/offsets should work in our case as well. BTW, there shouldn't be very large numbers for array indices with the default limit for block size in a CFile of 256KiByte. Since the values are LEB128-encoded, for 4-byte integer values that would be just 2 bytes after the encoding for indices close to 64Ki. Yes, that's would be about 2 times more compared with storing array sizes if the majority of arrays contain less than 256 elements. In absolute terms, for an extreme case of 10000 single element arrays we are talking about ~10KiByte vs ~20KiByte. -- To view, visit http://gerrit.cloudera.org:8080/22058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8972b3791d155e102240c80012e2b87192914cd1 Gerrit-Change-Number: 22058 Gerrit-PatchSet: 1 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Wed, 13 Nov 2024 22:24:49 +0000 Gerrit-HasComments: Yes
