Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/8267 )
Change subject: IMPALA-4177,IMPALA-6039: batched bit reading and rle decoding ...................................................................... Patch Set 8: (18 comments) Thank you for reworking this code. I did a pass over the low level files and left mostly questions and minor suggestions. I still need to look at the integration into the column readers and the tests and benchmarks. http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-packing.h File be/src/util/bit-packing.h: http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-packing.h@67 PS8, Line 67: static std::pair<const uint8_t*, int64_t> UnpackValues(const uint8_t* __restrict__ in, Could this one be private? http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-packing.h@81 PS8, Line 81: static std::pair<const uint8_t*, int64_t> UnpackAndDecodeValues( Could this one be private? http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-packing.h@86 PS8, Line 86: /// Unpack exactly 32 values of 'bit_width' from 'in' to 'out'. 'in' must point to Is there a benefit of having those public over making them private and make the tests that use them a friend? http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-packing.h@114 PS8, Line 114: 'bit_width' Is this comment still correct? From looking at the implementation I also thought that this returns at most 31 values and that the caller has to make sure to set num_values accordingly. http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-packing.inline.h@56 PS8, Line 56: return UnpackValues<OutType, i>(in, in_bytes, num_values, out); nit: I think adding a blank line between the definition of the macro and calling it may increase readability. Feel free to ignore though. http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-stream-utils.h File be/src/util/bit-stream-utils.h: http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-stream-utils.h@33 PS8, Line 33: /// TODO: replace uses with the more efficient BatchedBitReader. Would that mean replacing the Writer with the BatchedReader? http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-stream-utils.inline.h File be/src/util/bit-stream-utils.inline.h: http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/bit-stream-utils.inline.h@115 PS8, Line 115: auto result = BitPacking::UnpackAndDecodeValues(bit_width, buffer_pos_, You could use std::tie to unpack the result. http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@93 PS8, Line 93: uint32_t NextNumRepeats(); Could we convert all the uint32_t to int64_t while we're here? There's already a JIRA to do this and I think we generally try to avoid unsigned types unless there's a specific reason. http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@105 PS8, Line 105: /// copying the values to 'values'. Can you add a comment on the return value, here and below? http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@134 PS8, Line 134: decoded nit: decode http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@137 PS8, Line 137: static constexpr int LITERAL_BUFFER_LEN = 32; out of curiosity, does this need constexpr instead of const? The latter should be sufficient to specify the array size below. http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@145 PS8, Line 145: exhaustive nit: exhausted? http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@153 PS8, Line 153: bool FillLiteralBuffer() WARN_UNUSED_RESULT; Do we give any guarantees what the caller can do after this returns false? Should we mention that they have to call Reset() to get the object back into a usable state? http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@159 PS8, Line 159: /// Output buffered literals and decrement 'literal_count_'. Should we mention that it also increments 'literal_buffer_pos_'? What is the return value? http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@494 PS8, Line 494: boundery nit: typo http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@498 PS8, Line 498: int num_read = bit_reader_.UnpackBatch(bit_width_, num_to_bypass, values + num_consumed); nit: long line http://gerrit.cloudera.org:8080/#/c/8267/8/be/src/util/rle-encoding.h@585 PS8, Line 585: if (UNLIKELY(literal_count_ == 0)) return; This line doesn't seem to have any effect? Is it there for perf reasons? If so, can you add a brief comment? http://gerrit.cloudera.org:8080/#/c/8267/7/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/8267/7/testdata/data/README@121 PS7, Line 121: write nit: writer -- To view, visit http://gerrit.cloudera.org:8080/8267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6 Gerrit-Change-Number: 8267 Gerrit-PatchSet: 8 Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Thu, 19 Oct 2017 23:04:15 +0000 Gerrit-HasComments: Yes
