Gabriella Gyorgyevics has posted comments on this change. ( http://gerrit.cloudera.org:8080/22165 )
Change subject: Wrote the decoder and encoder for byte stream split encoding. ...................................................................... Patch Set 9: (29 comments) Thanks, I fixed them all. I also have a question in encoder.cc http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h File be/src/exec/parquet/parquet-byte-stream-split-decoder.h: http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@24 PS9, Line 24: // This class can be used to decode byte-stream-split encoded values from a buffer. To > Could add a link to https://github.com/apache/parquet-format/blob/master/En Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@35 PS9, Line 35: // The data pointed to by `input_buffer` is the first byte of encoded data. > I don't think this line is needed, instead, in the next line we could say " Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@37 PS9, Line 37: size given in the : // constructor > Here and also later in other comments (also in the encoder), it could be ea Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@66 PS9, Line 66: = > To avoid confusion, I think it's better to use "==" even in comments. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@92 PS9, Line 92: o > Nit: space. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc File be/src/exec/parquet/parquet-byte-stream-split-decoder.cc: http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@30 PS9, Line 30: if (input_buffer_len < 0) { > I think this should be checked before the one on L26 - if a negative number Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@50 PS9, Line 50: if (input_buffer_len_ == 0) return 0; > Do we actually need this check? Won't the next one also handle this case? It was originally here to check whether we can return 0 instead of running through the whole function, but now the whole function is pretty short so I removed it in both places, thanks. http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.h File be/src/exec/parquet/parquet-byte-stream-split-encoder.h: http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.h@37 PS9, Line 37: data > "of the output buffer" would be better. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.h@38 PS9, Line 38: where > Something is missing, maybe "... is where ...". Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.h@48 PS9, Line 48: * > Other encoders take the parameter by value or const reference. Impala also Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc File be/src/exec/parquet/parquet-byte-stream-split-encoder.cc: http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@31 PS9, Line 31: if (output_buffer_len < 0) { > I think this should be checked first. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@49 PS9, Line 49: if(byte_list_.size() > output_buffer_len_) { return -1; } > This could be converted to a DCHECK as it is a bug in the code if it happen Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@63 PS9, Line 63: byte_list_.clear(); Is `byte_list_.clear()` the best function to use here? It doesn't change the capacity to 0, so the space it reserved for a big data load will stay reserved after this as well. So for example if we encoded 1000 values in one go and only 10 in the next, most of its capacity wouldn't be used. On the other hand, if we completely reset it (with capacity = 0), it will have to reallocate every time we reset the encoder. For example if we encode 1000 values first, it had to reallocate 11 times, then if we want to encode 1000 values again, it would have to reallocate 11 times again. Which would be more effective? There is also an option to reserve the vectors capacity in `NewPage()` to `output_buffer_len_`, which would mean that it can store exactly as many values as the buffer would be able to. This way it would never need to reallocate, and assuming that the buffer the same size as the number of values the user wants to encode (question is can we assume that??), the vector would also be filled up completely and no reserved space would be wasted. http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@61 PS9, Line 61: output_buffer_ = nullptr; : output_buffer_len_ = 0; : byte_list_.clear(); > These could be extracted into a Reset() function, which could also be calle Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc File be/src/exec/parquet/parquet-byte-stream-split-test.cc: http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@66 PS9, Line 66: basic_func > This could have a more meaningful name, for example something like "test_in Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@67 PS9, Line 67: int > Could be constexpr, also in the other functions. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@71 PS9, Line 71: 3 > The constant 3 should be extracted. It could also be initialised to "buffer Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@116 PS9, Line 116: buffer > "input" or "input_buffer" would be more descriptive, also for the other fun Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@121 PS9, Line 121: sizeof(T) > Could use 'byte_size' here. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@121 PS9, Line 121: l > I think it should be plural ("expectedNumerals"). Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@140 PS9, Line 140: // > Nit: add two spaces of intentation. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@144 PS9, Line 144: decoder.GetTotalValueCount() > The decoder hasn't been initialised at this point, so this will return 0. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@187 PS9, Line 187: + byte_size > Here we are reading past the end of the vector's data, this is undefined be I think something got mixed up here, `expected_subset` is not needed at all. I removed it, thank you. http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@196 PS9, Line 196: decode_stride > "decode_with_stride" would be better. Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@252 PS9, Line 252: 0 > This should be 'check_against', shouldn't it? Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@276 PS9, Line 276: Basic Functionality Tests > Shouldn't it be "Decoding single values" or something similar? Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@326 PS9, Line 326: int64 > Should be "int64_t". Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@415 PS9, Line 415: // this is needed, because the expected vector has the values scattered > If we're encoding only one value, the output should be the same as the inpu Done http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@455 PS9, Line 455: compareCount > Could use memcmp instead (see https://en.cppreference.com/w/cpp/string/byte Done -- To view, visit http://gerrit.cloudera.org:8080/22165 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icea60894ae22b8ddb7616aeda6d69012cc69972c Gerrit-Change-Number: 22165 Gerrit-PatchSet: 9 Gerrit-Owner: Gabriella Gyorgyevics <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabriella Gyorgyevics <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Comment-Date: Fri, 03 Jan 2025 10:13:34 +0000 Gerrit-HasComments: Yes
