Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22058 )

Change subject: WIP [docs] add information on nullable array data block
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/22058/1/docs/design-docs/cfile.md
File docs/design-docs/cfile.md:

http://gerrit.cloudera.org:8080/#/c/22058/1/docs/design-docs/cfile.md@151
PS1, Line 151: array start indices
> When storing the starting indices of each array, we might end up storing la
That's a good point.  I think we can consider this option to express 
coordinates of array cells as well.

Originally, I was thinking of expressing the offsets in rather "absolute" than 
"relative" coordinates.  That seems better because of convenient access to a 
particular cell in the flattened sequence (i.e. fetching the data of particular 
array): there isn't a need to go through the elements of the array cells 
coordinates from the very beginning to find the position in the flattened 
sequence.

I recall Arrow uses similar notation 
(https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout) 
in offset buffers for variable-sized lists, and I thought to apply similar 
approach here as well.  However, so far it seems using sizes instead of 
indices/offsets should work in our case as well.

BTW, there shouldn't be very large numbers for array indices with the default 
limit for block size in a CFile of 256KiByte.  Since the values are 
LEB128-encoded, for 4-byte integer values that would be just 2 bytes after the 
encoding for indices close to 64Ki.  Yes, that's would be about 2 times more 
compared with storing array sizes if the majority of arrays contain less than 
256 elements.  In absolute terms, for an extreme case of 10000 single element 
arrays we are talking about ~10KiByte vs ~20KiByte.



--
To view, visit http://gerrit.cloudera.org:8080/22058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8972b3791d155e102240c80012e2b87192914cd1
Gerrit-Change-Number: 22058
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 13 Nov 2024 22:24:49 +0000
Gerrit-HasComments: Yes

Reply via email to