[Impala-ASF-CR] Wrote the decoder and encoder for byte stream split encoding.

Gabriella Gyorgyevics (Code Review) Fri, 03 Jan 2025 02:18:27 -0800

Gabriella Gyorgyevics has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22165 )


Change subject: Wrote the decoder and encoder for byte stream split encoding.
......................................................................


Patch Set 9:

(29 comments)

Thanks, I fixed them all. I also have a question in encoder.cc

http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h
File be/src/exec/parquet/parquet-byte-stream-split-decoder.h:

http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@24
PS9, Line 24: // This class can be used to decode byte-stream-split encoded 
values from a buffer. To
> Could add a link to https://github.com/apache/parquet-format/blob/master/En
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@35
PS9, Line 35:   // The data pointed to by `input_buffer` is the first byte of 
encoded data.
> I don't think this line is needed, instead, in the next line we could say "
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@37
PS9, Line 37: size given in the
            :   // constructor
> Here and also later in other comments (also in the encoder), it could be ea
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@66
PS9, Line 66: =
> To avoid confusion, I think it's better to use "==" even in comments.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.h@92
PS9, Line 92: o
> Nit: space.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc
File be/src/exec/parquet/parquet-byte-stream-split-decoder.cc:

http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@30
PS9, Line 30:   if (input_buffer_len < 0) {
> I think this should be checked before the one on L26 - if a negative number
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@50
PS9, Line 50:   if (input_buffer_len_ == 0) return 0;
> Do we actually need this check? Won't the next one also handle this case?
It was originally here to check whether we can return 0 instead of running 
through the whole function, but now the whole function is pretty short so I 
removed it in both places, thanks.


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.h
File be/src/exec/parquet/parquet-byte-stream-split-encoder.h:

http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.h@37
PS9, Line 37: data
> "of the output buffer" would be better.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.h@38
PS9, Line 38: where
> Something is missing, maybe "... is where ...".
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.h@48
PS9, Line 48: *
> Other encoders take the parameter by value or const reference. Impala also
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc
File be/src/exec/parquet/parquet-byte-stream-split-encoder.cc:

http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@31
PS9, Line 31:   if (output_buffer_len < 0) {
> I think this should be checked first.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@49
PS9, Line 49:   if(byte_list_.size() > output_buffer_len_) { return -1; }
> This could be converted to a DCHECK as it is a bug in the code if it happen
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@63
PS9, Line 63: byte_list_.clear();
Is `byte_list_.clear()` the best function to use here?

It doesn't change the capacity to 0, so the space it reserved for a big data 
load will stay reserved after this as well. So for example if we encoded 1000 
values in one go and only 10 in the next, most of its capacity wouldn't be used.

On the other hand, if we completely reset it (with capacity = 0), it will have 
to reallocate every time we reset the encoder. For example if we encode 1000 
values first, it had to reallocate 11 times, then if we want to encode 1000 
values again, it would have to reallocate 11 times again.

Which would be more effective?

There is also an option to reserve the vectors capacity in `NewPage()` to 
`output_buffer_len_`, which would mean that it can store exactly as many values 
as the buffer would be able to. This way it would never need to reallocate, and 
assuming that the buffer the same size as the number of values the user wants 
to encode (question is can we assume that??), the vector would also be filled 
up completely and no reserved space would be wasted.


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@61
PS9, Line 61: output_buffer_ = nullptr;
            :   output_buffer_len_ = 0;
            :   byte_list_.clear();
> These could be extracted into a Reset() function, which could also be calle
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc
File be/src/exec/parquet/parquet-byte-stream-split-test.cc:

http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@66
PS9, Line 66: basic_func
> This could have a more meaningful name, for example something like "test_in
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@67
PS9, Line 67: int
> Could be constexpr, also in the other functions.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@71
PS9, Line 71: 3
> The constant 3 should be extracted. It could also be initialised to "buffer
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@116
PS9, Line 116: buffer
> "input" or "input_buffer" would be more descriptive, also for the other fun
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@121
PS9, Line 121: sizeof(T)
> Could use 'byte_size' here.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@121
PS9, Line 121: l
> I think it should be plural ("expectedNumerals").
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@140
PS9, Line 140: //
> Nit: add two spaces of intentation.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@144
PS9, Line 144: decoder.GetTotalValueCount()
> The decoder hasn't been initialised at this point, so this will return 0.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@187
PS9, Line 187:  + byte_size
> Here we are reading past the end of the vector's data, this is undefined be
I think something got mixed up here, `expected_subset` is not needed at all. I 
removed it, thank you.


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@196
PS9, Line 196: decode_stride
> "decode_with_stride" would be better.
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@252
PS9, Line 252: 0
> This should be 'check_against', shouldn't it?
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@276
PS9, Line 276: Basic Functionality Tests
> Shouldn't it be "Decoding single values" or something similar?
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@326
PS9, Line 326: int64
> Should be "int64_t".
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@415
PS9, Line 415:     // this is needed, because the expected vector has the 
values scattered
> If we're encoding only one value, the output should be the same as the inpu
Done


http://gerrit.cloudera.org:8080/#/c/22165/9/be/src/exec/parquet/parquet-byte-stream-split-test.cc@455
PS9, Line 455: compareCount
> Could use memcmp instead (see https://en.cppreference.com/w/cpp/string/byte
Done



--
To view, visit http://gerrit.cloudera.org:8080/22165
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icea60894ae22b8ddb7616aeda6d69012cc69972c
Gerrit-Change-Number: 22165
Gerrit-PatchSet: 9
Gerrit-Owner: Gabriella Gyorgyevics <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Gabriella Gyorgyevics <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Comment-Date: Fri, 03 Jan 2025 10:13:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] Wrote the decoder and encoder for byte stream split encoding.

Reply via email to