[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking

Tim Armstrong (Code Review) Mon, 10 Oct 2016 14:29:07 -0700

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4123: Fast bit unpacking
......................................................................



Patch Set 7:

(28 comments)

http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/benchmarks/bit-packing-benchmark.cc
File be/src/benchmarks/bit-packing-benchmark.cc:

PS7, Line 330: int argc, char **argv
> unused
It seems weird to me to omit the arguments to main()


PS7, Line 336: int64_t
> const, or just make this the param to the data constructor and use data.siz
Done


PS7, Line 343:     suite.AddBenchmark(Substitute("UnpackScalar", bit_width), 
UnpackBenchmark, &params);
> This line at bit_width 32 is not in the output you display in the comment a
Oops, fat finger error


http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing-test.cc
File be/src/util/bit-packing-test.cc:

PS7, Line 35:  a subset of
            : /// 'num_in_values'
> I still find this wording confusing - Which is the to buffer, and which the
I reworded to make it clearer that it's a subarray.


PS7, Line 42: misaligned
> When misalignment is 0, what alignment is the memory guaranteed to have? 0 
In practice it should be 64-bit aligned with TCMalloc beyond a certain 
allocation size. I switched to using posix_memalign so that it's explicitly 
guaranteed.


PS7, Line 49: uint32_t
> const
Done


Line 93:     uint32_t* in, uint8_t* packed, int num_in_values, int bit_width, 
int misalignment) {
> const T*
Done


PS7, Line 94: int
> const
Done


PS7, Line 99: vector
> std::aligned_storage?
Ended up adding a helper class.


PS7, Line 102:  sizeof(uint32_t)
> Why is this needed?
Fixed. It really should have been uint8_t.

It turned out this was masking a bug where we overran the buffer with 64-bit 
loads for odd bitwidths. I ended up just switching to doing 32-bit loads 
always. There wasn't a significant performance hit (if anything it looked a bit 
more consistent).


PS7, Line 132: min(length, 1)
> so the only misalignments allowed are 0 and 1? Why so restrictive?
Other ones should work but I don't think improve coverage - can't see a 
scenario where it would work for 1 byte-aligned and 8 byte-aligned but not 
2-byte or 4-byte aligned.


http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.h
File be/src/util/bit-packing.h:

Line 22: #ifndef IMPALA_UTIL_BIT_PACKING_H
> These usually go above the #includes
Done


PS7, Line 30: values
> 'value'
Done


PS7, Line 48: BitPacking
> Why not put this into use in this patch? In other words, why only the micro
The BitReader need some rework to be able to make use of the batched interface 
- I thought it would be better to break it up into smaller patches. It also 
makes it easier to compare again the older implementation before reworking that.


PS7, Line 67:  bits of data
> You can elide this phrase
Done


http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h
File be/src/util/bit-packing.inline.h:

PS4, Line 125:     const uint32_t trailing_bits = in_words[IN_WORD_IDX + 1] % 
> I'm not so sure. Let's consider reading 10 bit ints using 32-bit loads. To 
That doesn't seem right - the nonoverlapping aligned loads will always minimise 
the number of loads required. If you have a series of overlapping unaligned 
loads if you shift each load along by enough bytes that it doesn't overlap with 
the previous you get a series of nonoverlapping aligned loads that read all the 
bits you need. 

For batches of 32, the input is always a multiple of 32 bits, so the aligned 
loads always exactly cover the input bytes. So you can't shift one of the loads 
without creating a gap in the input array that you need to add another load to 
read.

The example has a bug somewhere - those loads only get you bits 0 to 112, which 
only contains the first 11 values.  Reading 11 10-bit values would require 4 
aligned loads, so that might be a case where you could slightly improve things 
by using unaligned loads. It does depend on the # of input values though so 
it's unclear how you'd exploit the # of input values without adding runtime 
overhead.

Anyway, from what I've seen it would be more impactful to put later 
optimisation effort into adding AVX2, which could be 4x faster.


http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.inline.h
File be/src/util/bit-packing.inline.h:

PS7, Line 45: make_
> make_ can be omitted.
Done


PS7, Line 57: const
> constexpr BATCH_SIZE
Done


PS7, Line 69:     in_bytes -= batch_size * (BIT_WIDTH / CHAR_BIT);
> so in_bytes does not move if BIT_WIDTH is 7?
Yep, that was a bug.


PS7, Line 90: onstants
> constants
Done


PS7, Line 91: should
> Should this be "must", since  VALUE_IDX is now a template param? Not that t
Yes, that's true - updated.


PS7, Line 115: else
> if branch returns, so you can drop the "else" here.
I found it clearer to have the if/else if/else just to make it obvious that the 
three branches are mutually exclusive. Normally I'd use the early return but 
here with only one level of indentation I didn't see a big benefit in 
readability.


Line 120:     DCHECK_LT(VALUE_IDX, 31) << "Trailing bits after last word.";
> Can you explain more why VALUE_IDX < 31? Can you add a static_assert at the
VALUE_IDX < 32 generally (added that static_assert) but this branch shouldn't 
be taken for the last value. Changed the message so that it's a little more 
explicit that the branch shouldn't be taken.


PS7, Line 126: warning
> What template values cause that warning to trigger?
BIT_WIDTH == 32, VALUE_IDX = *
  NUM_TRAILING_BITS == 0 (since we're not really unpacking, just reading 32-bit 
values)
  (BIT_WIDTH - NUM_TRAILING_BITS) == 32

The branch isn't actually taken in that case, but the warning apparently isn't 
able to reason about control flow well enough to know that.


PS7, Line 178: const
> constexpr MAX_BATCH_SIZE
Done


PS7, Line 179: int
> const
Done


PS7, Line 183:   // Copy into padded temporary buffer to avoid reading past the 
end of 'in'.
> Can this be avoided sometimes? For instance, if BIT_WIDTH is 6 and num_valu
It's avoidable if roundup(bytes_to_read, sizeof(uint32_t)) <= in_bytes. Added 
this check to decide whether to do the copy or not.


PS7, Line 184: BIT_WIDTH
> if BIT_WIDTH is 0, I don't think this is technically allowed.
Made sure it's 1 byte in that case.


-- 
To view, visit http://gerrit.cloudera.org:8080/4494
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking

Reply via email to