jhorstmann commented on a change in pull request #8262:
URL: https://github.com/apache/arrow/pull/8262#discussion_r501769149



##########
File path: rust/arrow/src/buffer.rs
##########
@@ -369,120 +394,171 @@ where
     result.freeze()
 }
 
+/// Apply a bitwise operation `op` to two inputs and return the result as a 
Buffer.
+/// The inputs are treated as bitmaps, meaning that offsets and length are 
specified in number of bits.
 fn bitwise_bin_op_helper<F>(
     left: &Buffer,
-    left_offset: usize,
+    left_offset_in_bits: usize,
     right: &Buffer,
-    right_offset: usize,
-    len: usize,
+    right_offset_in_bits: usize,
+    len_in_bits: usize,
     op: F,
 ) -> Buffer
 where
-    F: Fn(u8, u8) -> u8,
+    F: Fn(u64, u64) -> u64,
 {
-    let mut result = MutableBuffer::new(len).with_bitset(len, false);
+    // reserve capacity and set length so we can get a typed view of u64 chunks
+    let mut result =
+        MutableBuffer::new(ceil(len_in_bits, 8)).with_bitset(len_in_bits / 64 
* 8, false);
 
-    result
-        .data_mut()
-        .iter_mut()
-        .zip(
-            left.data()[left_offset..]
-                .iter()
-                .zip(right.data()[right_offset..].iter()),
-        )
+    let left_chunks = left.bit_chunks(left_offset_in_bits, len_in_bits);
+    let right_chunks = right.bit_chunks(right_offset_in_bits, len_in_bits);
+    let result_chunks = result.typed_data_mut::<u64>().iter_mut();
+
+    result_chunks
+        .zip(left_chunks.iter().zip(right_chunks.iter()))

Review comment:
       Yes, that sounds correct. The result buffer is newly created and 
properly aligned for the largest primitive types.
   
   The `with_bitset` call sets the len of the buffer to multiple of 8bytes / 
64bits, as otherwise the `typed_data_mut` function would complain. The capacity 
of the buffer is however large enough to also contain the remainder bytes.
   
   There are some comparison opcodes left in the assembly, I couldn't find a 
way to convince the compiler that all iterators have the same length. Still it 
seems to be not much slower than the simd version.

##########
File path: rust/arrow/src/buffer.rs
##########
@@ -369,120 +394,171 @@ where
     result.freeze()
 }
 
+/// Apply a bitwise operation `op` to two inputs and return the result as a 
Buffer.
+/// The inputs are treated as bitmaps, meaning that offsets and length are 
specified in number of bits.
 fn bitwise_bin_op_helper<F>(
     left: &Buffer,
-    left_offset: usize,
+    left_offset_in_bits: usize,
     right: &Buffer,
-    right_offset: usize,
-    len: usize,
+    right_offset_in_bits: usize,
+    len_in_bits: usize,
     op: F,
 ) -> Buffer
 where
-    F: Fn(u8, u8) -> u8,
+    F: Fn(u64, u64) -> u64,
 {
-    let mut result = MutableBuffer::new(len).with_bitset(len, false);
+    // reserve capacity and set length so we can get a typed view of u64 chunks
+    let mut result =
+        MutableBuffer::new(ceil(len_in_bits, 8)).with_bitset(len_in_bits / 64 
* 8, false);
 
-    result
-        .data_mut()
-        .iter_mut()
-        .zip(
-            left.data()[left_offset..]
-                .iter()
-                .zip(right.data()[right_offset..].iter()),
-        )
+    let left_chunks = left.bit_chunks(left_offset_in_bits, len_in_bits);
+    let right_chunks = right.bit_chunks(right_offset_in_bits, len_in_bits);
+    let result_chunks = result.typed_data_mut::<u64>().iter_mut();
+
+    result_chunks
+        .zip(left_chunks.iter().zip(right_chunks.iter()))

Review comment:
       I don't really have fancy tooling for that. For isolated functions, the 
[compiler explorer][1] is very useful. When looking at benchmarks with bigger 
dependencies I use the `objdump` command which I think comes with linux by 
default. So I run a benchmark, notice the executable it is running and then do 
something like
   
   ```
   objdump -M intel -S target/release/deps/buffer_bit_ops-a5cda2cafee5a9f9 | 
less
   ```
   
   And then try to find my method in there. Sometimes it helps to give the 
method a unique name (`sum` for a kernel was hard to find) and to mark the 
method I'm interested in as `#inline(never)`. Reading and understanding the 
assembly then is the tricky part. I usually expect a certain instruction 
sequence in the innermost loop that I'm looking for, here for example a 
combination of left shift / right shift (from the bit slice iterator) and 
binary and.
   
   Vector instructions are usually a bit easier to spot since they are not that 
common. A good heuristic is also that simpler + shorter instruction sequences = 
faster.
   
   I haven't tried it yet, but a reverse engineering tool like [Ghidra][2] 
could be useful to more easily analyze loops.
   
    [1]: https://rust.godbolt.org/
    [2]: https://ghidra-sre.org/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to