scovich commented on code in PR #9823:
URL: https://github.com/apache/arrow-rs/pull/9823#discussion_r3156355572


##########
arrow-array/src/array/boolean_array.rs:
##########
@@ -608,6 +608,62 @@ impl BooleanArray {
         }
     }
 
+    /// Returns a new [`BooleanArray`] of the same length where only the first
+    /// `n` non-null `true` positions remain `true`; any `true` positions
+    /// beyond the first `n` are replaced with `false`. The null buffer is
+    /// preserved unchanged.
+    ///
+    /// If this array has at most `n` non-null `true` values, `self` is
+    /// returned unchanged.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// # use arrow_array::BooleanArray;
+    /// let a = BooleanArray::from(vec![true, false, true, true, false, true]);
+    /// // Keep only the first 2 `true` positions; later trues become false.
+    /// let r = a.take_n_true(2);
+    /// assert_eq!(r, BooleanArray::from(vec![true, false, true, false, false, 
false]));
+    /// ```
+    pub fn take_n_true(self, n: usize) -> BooleanArray {
+        let len = self.len();
+        if n == 0 {
+            if !self.has_true() {
+                return self;
+            }
+            return BooleanArray::new(BooleanBuffer::new_unset(len), 
self.nulls);
+        }
+
+        // `set_indices` scans 64 bits at a time via `trailing_zeros`, so 
locating
+        // the n-th set bit is cheaper than visiting every bit. When a null 
buffer
+        // is present, skip set bits whose corresponding entry is null so only
+        // non-null trues count toward `n` (matching `true_count` semantics).
+        let (last_kept, has_more) = match self.nulls.as_ref() {
+            None => {
+                let mut iter = self.values.set_indices();
+                match iter.nth(n - 1) {
+                    Some(i) => (i, iter.next().is_some()),
+                    None => return self,
+                }

Review Comment:
   nit: consider doing
   ```suggestion
                   let Some(i) = iter.nth(n - 1) else {
                       return self;
                   };
                   (i, iter.next().is_some())
   ```



##########
arrow-array/src/array/boolean_array.rs:
##########
@@ -608,6 +608,62 @@ impl BooleanArray {
         }
     }
 
+    /// Returns a new [`BooleanArray`] of the same length where only the first
+    /// `n` non-null `true` positions remain `true`; any `true` positions
+    /// beyond the first `n` are replaced with `false`. The null buffer is
+    /// preserved unchanged.
+    ///
+    /// If this array has at most `n` non-null `true` values, `self` is
+    /// returned unchanged.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// # use arrow_array::BooleanArray;
+    /// let a = BooleanArray::from(vec![true, false, true, true, false, true]);
+    /// // Keep only the first 2 `true` positions; later trues become false.
+    /// let r = a.take_n_true(2);
+    /// assert_eq!(r, BooleanArray::from(vec![true, false, true, false, false, 
false]));
+    /// ```
+    pub fn take_n_true(self, n: usize) -> BooleanArray {
+        let len = self.len();
+        if n == 0 {
+            if !self.has_true() {

Review Comment:
   I'm not sure this early-return is worthwhile? 
[BooleanArray::has_true](https://docs.rs/arrow/latest/arrow/array/struct.BooleanArray.html#method.has_true)
 costs `O(n)`, not `O(1)`, and does a similar operation to the `set_indices` + 
`is_valid` filter code below? 
   
   How expensive would this corner case be, if it ran through the normal code 
below? 
   Seems like it would actually be pretty efficient?



##########
parquet/src/arrow/arrow_reader/read_plan.rs:
##########
@@ -409,35 +409,6 @@ impl LimitedReadPlanBuilder {
     }
 }
 
-/// Produce a new `BooleanArray` of the same length as `filter` in which only
-/// the first `n` `true` positions from `filter` remain `true`; any `true`
-/// positions beyond the first `n` are replaced with `false`.
-///
-/// `filter` must not contain nulls (callers apply [`prep_null_mask_filter`]
-/// first). If `filter` has at most `n` `true` values, a clone is returned.
-fn truncate_filter_after_n_trues(filter: BooleanArray, n: usize) -> 
BooleanArray {

Review Comment:
   Do callers still (need to) apply this `prep_null_mask_filter` function, when 
the new code handles nulls correctly?



##########
arrow-array/src/array/boolean_array.rs:
##########
@@ -608,6 +608,62 @@ impl BooleanArray {
         }
     }
 
+    /// Returns a new [`BooleanArray`] of the same length where only the first
+    /// `n` non-null `true` positions remain `true`; any `true` positions
+    /// beyond the first `n` are replaced with `false`. The null buffer is
+    /// preserved unchanged.
+    ///
+    /// If this array has at most `n` non-null `true` values, `self` is
+    /// returned unchanged.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// # use arrow_array::BooleanArray;
+    /// let a = BooleanArray::from(vec![true, false, true, true, false, true]);
+    /// // Keep only the first 2 `true` positions; later trues become false.
+    /// let r = a.take_n_true(2);
+    /// assert_eq!(r, BooleanArray::from(vec![true, false, true, false, false, 
false]));
+    /// ```
+    pub fn take_n_true(self, n: usize) -> BooleanArray {
+        let len = self.len();
+        if n == 0 {
+            if !self.has_true() {
+                return self;
+            }
+            return BooleanArray::new(BooleanBuffer::new_unset(len), 
self.nulls);
+        }
+
+        // `set_indices` scans 64 bits at a time via `trailing_zeros`, so 
locating
+        // the n-th set bit is cheaper than visiting every bit. When a null 
buffer
+        // is present, skip set bits whose corresponding entry is null so only
+        // non-null trues count toward `n` (matching `true_count` semantics).
+        let (last_kept, has_more) = match self.nulls.as_ref() {
+            None => {
+                let mut iter = self.values.set_indices();
+                match iter.nth(n - 1) {
+                    Some(i) => (i, iter.next().is_some()),
+                    None => return self,
+                }

Review Comment:
   Also, I wonder if we could consolidate the overall logic as:
   ```rust
   let mut iter = self.values.set_indices();
   let end = match self.nulls.as_ref() {
       Some(nulls) => iter.filter(|&i| nulls.is_valid(i)).nth(n),
       None => iter.nth(n),
   };
   let Some(end) = end else {
       return self;
   };
   
     ...
   builder.append_buffer(&self.values.slice(0, end));
   builder.append_n(len - end, false);
   ```
   Observation: `nth(n)` gives the index of the first `true` value we should 
discard (if any). We don't especially care where the last `true` value we 
should keep lives, because it's harmless for `append_buffer` to copy the 0+ 
intervening false/null values; the subsequent `append_n` call just has less 
work to do.



##########
parquet/src/arrow/arrow_reader/read_plan.rs:
##########
@@ -248,7 +248,7 @@ impl ReadPlanBuilder {
             match limit {
                 Some(limit) if matched_rows + filter.true_count() >= limit => {
                     let needed = limit - matched_rows;
-                    let truncated = truncate_filter_after_n_trues(filter, 
needed);
+                    let truncated = filter.take_n_true(needed);

Review Comment:
   This feels a bit redundant -- what goes wrong if we just call 
`filter.take_n_true(limit - matched_rows)` directly, and skip the `true_count` 
juggling that already pays `O(n)`?
   
   The missing piece is, we somehow need to know how many true values 
`take_n_true` actually found... but it seems just as cheap to call `true_count` 
on the `truncated` result, no?
   ```rust
   match limit {
       Some(limit) if limit - matched_rows < filter.len() => {
           // There is a limit, and the current batch has enough rows to 
possibly hit the limit
           let truncated = filter.take_n_true(limit - matched_rows);
           matched_rows += truncated.true_count();
           filters.push(truncated);
           if matched_rows >= limit {
               break;
           }
       }
       _ => {
           matched_rows += filter.true_count();
           filters.push(filter);
       }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to