Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking ......................................................................
Patch Set 7: (28 comments) http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS7, Line 330: int argc, char **argv > unused It seems weird to me to omit the arguments to main() PS7, Line 336: int64_t > const, or just make this the param to the data constructor and use data.siz Done PS7, Line 343: suite.AddBenchmark(Substitute("UnpackScalar", bit_width), UnpackBenchmark, ¶ms); > This line at bit_width 32 is not in the output you display in the comment a Oops, fat finger error http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS7, Line 35: a subset of : /// 'num_in_values' > I still find this wording confusing - Which is the to buffer, and which the I reworded to make it clearer that it's a subarray. PS7, Line 42: misaligned > When misalignment is 0, what alignment is the memory guaranteed to have? 0 In practice it should be 64-bit aligned with TCMalloc beyond a certain allocation size. I switched to using posix_memalign so that it's explicitly guaranteed. PS7, Line 49: uint32_t > const Done Line 93: uint32_t* in, uint8_t* packed, int num_in_values, int bit_width, int misalignment) { > const T* Done PS7, Line 94: int > const Done PS7, Line 99: vector > std::aligned_storage? Ended up adding a helper class. PS7, Line 102: sizeof(uint32_t) > Why is this needed? Fixed. It really should have been uint8_t. It turned out this was masking a bug where we overran the buffer with 64-bit loads for odd bitwidths. I ended up just switching to doing 32-bit loads always. There wasn't a significant performance hit (if anything it looked a bit more consistent). PS7, Line 132: min(length, 1) > so the only misalignments allowed are 0 and 1? Why so restrictive? Other ones should work but I don't think improve coverage - can't see a scenario where it would work for 1 byte-aligned and 8 byte-aligned but not 2-byte or 4-byte aligned. http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.h File be/src/util/bit-packing.h: Line 22: #ifndef IMPALA_UTIL_BIT_PACKING_H > These usually go above the #includes Done PS7, Line 30: values > 'value' Done PS7, Line 48: BitPacking > Why not put this into use in this patch? In other words, why only the micro The BitReader need some rework to be able to make use of the batched interface - I thought it would be better to break it up into smaller patches. It also makes it easier to compare again the older implementation before reworking that. PS7, Line 67: bits of data > You can elide this phrase Done http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS4, Line 125: const uint32_t trailing_bits = in_words[IN_WORD_IDX + 1] % > I'm not so sure. Let's consider reading 10 bit ints using 32-bit loads. To That doesn't seem right - the nonoverlapping aligned loads will always minimise the number of loads required. If you have a series of overlapping unaligned loads if you shift each load along by enough bytes that it doesn't overlap with the previous you get a series of nonoverlapping aligned loads that read all the bits you need. For batches of 32, the input is always a multiple of 32 bits, so the aligned loads always exactly cover the input bytes. So you can't shift one of the loads without creating a gap in the input array that you need to add another load to read. The example has a bug somewhere - those loads only get you bits 0 to 112, which only contains the first 11 values. Reading 11 10-bit values would require 4 aligned loads, so that might be a case where you could slightly improve things by using unaligned loads. It does depend on the # of input values though so it's unclear how you'd exploit the # of input values without adding runtime overhead. Anyway, from what I've seen it would be more impactful to put later optimisation effort into adding AVX2, which could be 4x faster. http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS7, Line 45: make_ > make_ can be omitted. Done PS7, Line 57: const > constexpr BATCH_SIZE Done PS7, Line 69: in_bytes -= batch_size * (BIT_WIDTH / CHAR_BIT); > so in_bytes does not move if BIT_WIDTH is 7? Yep, that was a bug. PS7, Line 90: onstants > constants Done PS7, Line 91: should > Should this be "must", since VALUE_IDX is now a template param? Not that t Yes, that's true - updated. PS7, Line 115: else > if branch returns, so you can drop the "else" here. I found it clearer to have the if/else if/else just to make it obvious that the three branches are mutually exclusive. Normally I'd use the early return but here with only one level of indentation I didn't see a big benefit in readability. Line 120: DCHECK_LT(VALUE_IDX, 31) << "Trailing bits after last word."; > Can you explain more why VALUE_IDX < 31? Can you add a static_assert at the VALUE_IDX < 32 generally (added that static_assert) but this branch shouldn't be taken for the last value. Changed the message so that it's a little more explicit that the branch shouldn't be taken. PS7, Line 126: warning > What template values cause that warning to trigger? BIT_WIDTH == 32, VALUE_IDX = * NUM_TRAILING_BITS == 0 (since we're not really unpacking, just reading 32-bit values) (BIT_WIDTH - NUM_TRAILING_BITS) == 32 The branch isn't actually taken in that case, but the warning apparently isn't able to reason about control flow well enough to know that. PS7, Line 178: const > constexpr MAX_BATCH_SIZE Done PS7, Line 179: int > const Done PS7, Line 183: // Copy into padded temporary buffer to avoid reading past the end of 'in'. > Can this be avoided sometimes? For instance, if BIT_WIDTH is 6 and num_valu It's avoidable if roundup(bytes_to_read, sizeof(uint32_t)) <= in_bytes. Added this check to decide whether to do the copy or not. PS7, Line 184: BIT_WIDTH > if BIT_WIDTH is 0, I don't think this is technically allowed. Made sure it's 1 byte in that case. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
