Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/8860 )
Change subject: design-docs: improve cfile.md ...................................................................... Patch Set 1: (7 comments) http://gerrit.cloudera.org:8080/#/c/8860/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8860/1//COMMIT_MSG@19 PS1, Line 19: attrocious nit: atrocious http://gerrit.cloudera.org:8080/#/c/8860/1/docs/design-docs/cfile.md File docs/design-docs/cfile.md: http://gerrit.cloudera.org:8080/#/c/8860/1/docs/design-docs/cfile.md@76 PS1, Line 76: How big is a data block in bytes and row count, typically? : - How do we decide when a data block is full (by data size, by # of values, ...)? : - For Prefix encoding, how many restart points can we expect to be in a single : block? > - cfile block size is determined by the cfile_default_block_size flag (min It'd also be nice to clarify the behavior if a value overflows the buffer. Do we extend the buffer to fit it or truncate the buffer and put the value first in the next cblock? What happens if a single cell value is too big for a block? http://gerrit.cloudera.org:8080/#/c/8860/1/docs/design-docs/cfile.md@86 PS1, Line 86: group-varint coded > Not sure if group-varint encoding is also deprecated for this? It's not. http://gerrit.cloudera.org:8080/#/c/8860/1/docs/design-docs/cfile.md@100 PS1, Line 100: restart point" which is necessary for : faster binary searching. > A bit more explanation on how it is related to faster binary searching? Without restarts, the nth value in the block has to be computed from values 1..(n - 1), so binary searching into the block is not possible without decoding all previous values. With restart points, binary searching can find the largest restart point <= the desired value, and decode forward from there. http://gerrit.cloudera.org:8080/#/c/8860/1/docs/design-docs/cfile.md@133 PS1, Line 133: TODO(dan): No discussion of dictionary encoding, and the associated dictionary : block. > Yeah, I think it would be useful to link the more detail doc on .h +1 to leaving out the details of encodings here and just referring elsewhere. http://gerrit.cloudera.org:8080/#/c/8860/1/docs/design-docs/cfile.md@199 PS1, Line 199: my best guess is fwiw, I agree with your guess http://gerrit.cloudera.org:8080/#/c/8860/1/docs/design-docs/cfile.md@222 PS1, Line 222: queries like: "seek to the data block : containing the Nth entry in this CFile". > Should we add some insight on from which layer these queries are issued? +1. I'm thinking it'd be used to skip forward to index i in CFiles for non-primary key columns when the value index was used to skip forward and ended up at index i? -- To view, visit http://gerrit.cloudera.org:8080/8860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I770028bba3f7a49c96f32893c285221c84be39ce Gerrit-Change-Number: 8860 Gerrit-PatchSet: 1 Gerrit-Owner: Dan Burkert <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Wed, 20 Dec 2017 21:52:15 +0000 Gerrit-HasComments: Yes
