Re: [PR] lazily compute for null count(seems help to high cardinality aggr) [arrow-rs]

via GitHub Fri, 09 Aug 2024 08:25:49 -0700


Rachelint commented on code in PR #6155:
URL: https://github.com/apache/arrow-rs/pull/6155#discussion_r1711670573



##########
arrow-buffer/src/buffer/null.rs:
##########
@@ -15,35 +15,66 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use std::sync::atomic::{AtomicI64, Ordering};
+
 use crate::bit_iterator::{BitIndexIterator, BitIterator, BitSliceIterator};
 use crate::buffer::BooleanBuffer;
 use crate::{Buffer, MutableBuffer};
 
+const UNINITIALIZED_NULL_COUNT: i64 = -1;
+
+#[derive(Debug)]
+pub enum NullCount {
+    Eager(usize),
+    Lazy(AtomicI64),

Review Comment:
   > I feel the state here is a bit complicated, here we have three states: 
`Eager`, `Lazy (initialized)`, `Lazy (uninitialized)`. And we use both enum and 
fence value to differentiate them.
   > 
   > I wonder if we can simplify this with just two states: uninitialized and 
initialized; and when we try to read a uninitialized value, we count the value 
and set it to initialized state.
   > 
   > ```rust
   > struct NullCount {
   >   val: AtomicI64,
   > }
   > 
   > impl NullCount {
   >   fn get(&self, ...) -> i64 {
   >      let val = self.val.load(...);
   >      if val == UNINIT {
   >          val.store(cal_null_count);
   >      } else {
   >          return val
   >      }
   >   }
   > }
   > ```
   > 
   > This way we only have two states to manage, and we also keep the NullCount 
to be 8 byte, instead of the current 24 bytes, which might help with performance
   
   I don't want to make it so complicated too... And I impl it with two state 
at the beginning.
   
   But I found it seems to make some queries slower (maybe noise?) in my first 
version about this pr, and I encountered a strange performance problem maybe 
related to atomic in my another 
[pr](https://github.com/apache/datafusion/pull/11802), so for trying best to 
avoid the possible cost of atomic, I refactor it to this.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] lazily compute for null count(seems help to high cardinality aggr) [arrow-rs]

Reply via email to