[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9759: ARROW-12032: [Rust] Optimize comparison kernels using trusted_len iterator for bools

GitBox Sat, 20 Mar 2021 13:24:46 -0700


yordan-pavlov commented on a change in pull request #9759:
URL: https://github.com/apache/arrow/pull/9759#discussion_r598154125




##########
File path: rust/arrow/src/buffer/mutable.rs
##########
@@ -415,6 +415,61 @@ impl MutableBuffer {
         buffer
     }
 
+    /// Creates a [`MutableBuffer`] from a boolean [`Iterator`] with a trusted 
(upper) length.
+    /// # use arrow::buffer::MutableBuffer;
+    /// # Example
+    /// ```
+    /// # use arrow::buffer::MutableBuffer;
+    /// let v = vec![false, true, false];
+    /// let iter = v.iter().map(|x| *x || true);
+    /// let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(iter) 
};
+    /// assert_eq!(buffer.len(), 1) // 3 booleans have 1 byte
+    /// ```
+    /// # Safety
+    /// This method assumes that the iterator's size is correct and is 
undefined behavior
+    /// to use it on an iterator that reports an incorrect length.
+    // This implementation is required for two reasons:
+    // 1. there is no trait `TrustedLen` in stable rust and therefore
+    //    we can't specialize `extend` for `TrustedLen` like `Vec` does.
+    // 2. `from_trusted_len_iter_bool` is faster.
+    pub unsafe fn from_trusted_len_iter_bool<I: Iterator<Item = bool>>(
+        mut iterator: I,
+    ) -> Self {
+        let (_, upper) = iterator.size_hint();
+        let upper = upper.expect("from_trusted_len_iter requires an upper 
limit");
+
+        let mut result = {
+            let byte_capacity: usize = upper.saturating_add(7) / 8;
+            MutableBuffer::new(byte_capacity)
+        };
+
+        'a: loop {
+            let mut byte_accum: u8 = 0;
+            let mut mask: u8 = 1;
+
+            //collect (up to) 8 bits into a byte
+            while mask != 0 {
+                if let Some(value) = iterator.next() {
+                    byte_accum |= match value {

Review comment:
       I wonder if the bool iterator could be split into chunks (for example, 
using https://docs.rs/itertools/0.4.2/itertools/struct.Chunks.html or 
alternatively using 
https://doc.rust-lang.org/std/primitive.slice.html#method.chunks) of 8 bool 
values, then each chunk is mapped into a byte by converting each bool value 
into a byte (for example using std::mem::transmute::<bool, u8>), then shifting 
according to the position in the chunk, and applying in the output byte, and 
finally the resulting byte iterator would be used to build the buffer directly. 
This is the fastest implementation I can imagine because it eliminates as many 
conditions / checks as possible (and conditions are the enemy of fast).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9759: ARROW-12032: [Rust] Optimize comparison kernels using trusted_len iterator for bools

Reply via email to