Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking ......................................................................
Patch Set 5: (15 comments) http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS4, Line 296: bool > const Done PS4, Line 303: UnpackValues > Unpack32Values Done PS4, Line 312: + offset > This could go off the end of out_buffer if NUM_OUT_VALUES is not a multiple Yeah, that's an assumption to simplify the benchmark code. Added a static_assert up the top. PS4, Line 322: int64_t > const Done http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS4, Line 30: compute_mask > CamelCaseName Done PS4, Line 35: smaller > smaller than what? clarified. Line 116: const int num_in_values = 64 * 1024; > constexpr, and all caps name Done Line 119: in[i] = rand(); > std::generate Done http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS2, Line 57: ast one return > I don't understand yet why you have to choose a batch size of 8 for a bit w 8 is the smallest number such that (3 * 8) % 8 == 0. I.e. where the last bit of the batch aligns to a byte boundary. http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS4, Line 66: bytes > bits? Done PS4, Line 66: values > values of type OutType Done PS4, Line 70: in_bytes > Can you add to the comment a description of how this parameter is used? Ended up rewording the comment a bit to fit this in. http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS3, Line 143: undef > https://gcc.gnu.org/onlinedocs/gcc/Push_002fPop-Macro-Pragmas.html#Push_002 got you - done. http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS4, Line 125: return lower_bits | (trailing_bits << TRAILING_BITS_SHIFT); > This comes out to two shifts, a bitwise or, and a bitwise and. If you did u That's true, but there's a downside - if the loads are aligned relative to the start of the buffer, the optimiser can reuse the results of the loads. Doing the unaligned load adds an extra load, which is probably comparable in cost to the two ALU ops that were saved. Line 142: BOOST_PP_REPEAT_FROM_TO(0, 33, UNPACK_VALUE_CALL, ignore); > So, even after turning this into a templated function, a loop with static b Even with #pragma unroll and ALWAYS_INLINE. Inlining happens but the generated code is bad. I think either the unrolling or inlining happens too late in GCC's optimisation pipeline for the rest of the optimisation to be fully effective. I think part of it is that we have to demote the template argument VALUE_IDX to an argument, so the template doesn't fully specialise the code when it's instantiated. Here's an example of the generated assembly. Note that there are a lot of jmps and jes in the loop version https://gist.github.com/timarmstrong/4f68a74054cfd0df91a89a824ed04961 I added a note in the UnpackValue() comment to explain how it should be stamped out. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Jim Apple <jbap...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes