jhorstmann commented on a change in pull request #8262:
URL: https://github.com/apache/arrow/pull/8262#discussion_r501769149
##########
File path: rust/arrow/src/buffer.rs
##########
@@ -369,120 +394,171 @@ where
result.freeze()
}
+/// Apply a bitwise operation `op` to two inputs and return the result as a
Buffer.
+/// The inputs are treated as bitmaps, meaning that offsets and length are
specified in number of bits.
fn bitwise_bin_op_helper<F>(
left: &Buffer,
- left_offset: usize,
+ left_offset_in_bits: usize,
right: &Buffer,
- right_offset: usize,
- len: usize,
+ right_offset_in_bits: usize,
+ len_in_bits: usize,
op: F,
) -> Buffer
where
- F: Fn(u8, u8) -> u8,
+ F: Fn(u64, u64) -> u64,
{
- let mut result = MutableBuffer::new(len).with_bitset(len, false);
+ // reserve capacity and set length so we can get a typed view of u64 chunks
+ let mut result =
+ MutableBuffer::new(ceil(len_in_bits, 8)).with_bitset(len_in_bits / 64
* 8, false);
- result
- .data_mut()
- .iter_mut()
- .zip(
- left.data()[left_offset..]
- .iter()
- .zip(right.data()[right_offset..].iter()),
- )
+ let left_chunks = left.bit_chunks(left_offset_in_bits, len_in_bits);
+ let right_chunks = right.bit_chunks(right_offset_in_bits, len_in_bits);
+ let result_chunks = result.typed_data_mut::<u64>().iter_mut();
+
+ result_chunks
+ .zip(left_chunks.iter().zip(right_chunks.iter()))
Review comment:
Yes, that sounds correct. The result buffer is newly created and
properly aligned for the largest primitive types.
The `with_bitset` call sets the len of the buffer to multiple of 8bytes /
64bits, as otherwise the `typed_data_mut` function would complain. The capacity
of the buffer is however large enough to also contain the remainder bytes.
There are some comparison opcodes left in the assembly, I couldn't find a
way to convince the compiler that all iterators have the same length. Still it
seems to be not much slower than the simd version.
##########
File path: rust/arrow/src/buffer.rs
##########
@@ -369,120 +394,171 @@ where
result.freeze()
}
+/// Apply a bitwise operation `op` to two inputs and return the result as a
Buffer.
+/// The inputs are treated as bitmaps, meaning that offsets and length are
specified in number of bits.
fn bitwise_bin_op_helper<F>(
left: &Buffer,
- left_offset: usize,
+ left_offset_in_bits: usize,
right: &Buffer,
- right_offset: usize,
- len: usize,
+ right_offset_in_bits: usize,
+ len_in_bits: usize,
op: F,
) -> Buffer
where
- F: Fn(u8, u8) -> u8,
+ F: Fn(u64, u64) -> u64,
{
- let mut result = MutableBuffer::new(len).with_bitset(len, false);
+ // reserve capacity and set length so we can get a typed view of u64 chunks
+ let mut result =
+ MutableBuffer::new(ceil(len_in_bits, 8)).with_bitset(len_in_bits / 64
* 8, false);
- result
- .data_mut()
- .iter_mut()
- .zip(
- left.data()[left_offset..]
- .iter()
- .zip(right.data()[right_offset..].iter()),
- )
+ let left_chunks = left.bit_chunks(left_offset_in_bits, len_in_bits);
+ let right_chunks = right.bit_chunks(right_offset_in_bits, len_in_bits);
+ let result_chunks = result.typed_data_mut::<u64>().iter_mut();
+
+ result_chunks
+ .zip(left_chunks.iter().zip(right_chunks.iter()))
Review comment:
I don't really have fancy tooling for that. For isolated functions, the
[compiler explorer][1] is very useful. When looking at benchmarks with bigger
dependencies I use the `objdump` command which I think comes with linux by
default. So I run a benchmark, notice the executable it is running and then do
something like
```
objdump -M intel -S target/release/deps/buffer_bit_ops-a5cda2cafee5a9f9 |
less
```
And then try to find my method in there. Sometimes it helps to give the
method a unique name (`sum` for a kernel was hard to find) and to mark the
method I'm interested in as `#inline(never)`. Reading and understanding the
assembly then is the tricky part. I usually expect a certain instruction
sequence in the innermost loop that I'm looking for, here for example a
combination of left shift / right shift (from the bit slice iterator) and
binary and.
Vector instructions are usually a bit easier to spot since they are not that
common. A good heuristic is also that simpler + shorter instruction sequences =
faster.
I haven't tried it yet, but a reverse engineering tool like [Ghidra][2]
could be useful to more easily analyze loops.
[1]: https://rust.godbolt.org/
[2]: https://ghidra-sre.org/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]