Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4123: Fast bit unpacking
......................................................................


Patch Set 5:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/benchmarks/bit-packing-benchmark.cc
File be/src/benchmarks/bit-packing-benchmark.cc:

PS4, Line 296: bool
> const
Done


PS4, Line 303: UnpackValues
> Unpack32Values
Done


PS4, Line 312: + offset
> This could go off the end of out_buffer if NUM_OUT_VALUES is not a multiple
Yeah, that's an assumption to simplify the benchmark code. Added a 
static_assert up the top.


PS4, Line 322: int64_t
> const
Done


http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing-test.cc
File be/src/util/bit-packing-test.cc:

PS4, Line 30: compute_mask
> CamelCaseName
Done


PS4, Line 35: smaller
> smaller than what?
clarified.


Line 116:   const int num_in_values = 64 * 1024;
> constexpr, and all caps name
Done


Line 119:     in[i] = rand();
> std::generate
Done


http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h
File be/src/util/bit-packing.h:

PS2, Line 57: ast one return
> I don't understand yet why you have to choose a batch size of 8 for a bit w
8 is the smallest number such that (3 * 8) % 8 == 0. I.e. where the last bit of 
the batch aligns to a byte boundary.


http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.h
File be/src/util/bit-packing.h:

PS4, Line 66: bytes
> bits?
Done


PS4, Line 66: values
> values of type OutType
Done


PS4, Line 70: in_bytes
> Can you add to the comment a description of how this parameter is used?
Ended up rewording the comment a bit to fit this in.


http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing.inline.h
File be/src/util/bit-packing.inline.h:

PS3, Line 143: undef 
> https://gcc.gnu.org/onlinedocs/gcc/Push_002fPop-Macro-Pragmas.html#Push_002
got you - done.


http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h
File be/src/util/bit-packing.inline.h:

PS4, Line 125:     return lower_bits | (trailing_bits << TRAILING_BITS_SHIFT);
> This comes out to two shifts, a bitwise or, and a bitwise and. If you did u
That's true, but there's a downside - if the loads are aligned relative to the 
start of the buffer, the optimiser can reuse the results of the loads. Doing 
the unaligned load adds an extra load, which is probably comparable in cost to 
the two ALU ops that were saved.


Line 142:   BOOST_PP_REPEAT_FROM_TO(0, 33, UNPACK_VALUE_CALL, ignore);
> So, even after turning this into a templated function, a loop with static b
Even with #pragma unroll and ALWAYS_INLINE.

Inlining happens but the generated code is bad. I think either the unrolling or 
inlining happens too late in GCC's optimisation pipeline for the rest of the 
optimisation to be fully effective. I think part of it is that we have to 
demote the template argument VALUE_IDX to an argument, so the template doesn't 
fully specialise the code when it's instantiated.

Here's an example of the generated assembly. Note that there are a lot of jmps 
and jes in the loop version 
https://gist.github.com/timarmstrong/4f68a74054cfd0df91a89a824ed04961

I added a note in the UnpackValue() comment to explain how it should be stamped 
out.


-- 
To view, visit http://gerrit.cloudera.org:8080/4494
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jbap...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to