Re: [PR] Short string optimized`DistinctCountAccumulator` for string [arrow-datafusion]

via GitHub Tue, 16 Jan 2024 14:46:15 -0800


alamb commented on code in PR #8849:
URL: https://github.com/apache/arrow-datafusion/pull/8849#discussion_r1454134397



##########
benchmarks/queries/clickbench/README.md:
##########
@@ -28,6 +28,23 @@ SELECT
 FROM hits;
 ```
 
+### Q1
+Query to test distinct count for String. Three of them are all small string 
(length either 1 or 2).

Review Comment:
   I think these queries also fall under the data exploration category:
   
   ```suggestion
   Models initial Data exploration, to understand some statistics of data. 
   Query to test distinct count for String. Three of them are all small string 
(length either 1 or 2).
   ```



##########
benchmarks/queries/clickbench/README.md:
##########
@@ -28,6 +28,23 @@ SELECT
 FROM hits;
 ```
 
+### Q1
+Query to test distinct count for String. Three of them are all small string 
(length either 1 or 2).
+
+```sql
+SELECT 
+    COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), 
COUNT(DISTINCT "BrowserLanguage")
+FROM hits;
+```
+
+### Q2
+Query to test distinct count for String. "URL" has length greater than 8

Review Comment:
   Does this query test something signfiicantly different than the Q0 and Q1?
   
   What do you think about maybe adding a GROUP BY as well?
   
   ```
   ❯ SELECT  "BrowserCountry",  COUNT(DISTINCT "HitColor"), COUNT(DISTINCT 
"BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM 'hits.parquet' GROUP 
BY 1 ORDER BY 2 DESC LIMIT 10;
   
+----------------+---------------------------------------+---------------------------------------------+----------------------------------------------+
   | BrowserCountry | COUNT(DISTINCT hits.parquet.HitColor) | COUNT(DISTINCT 
hits.parquet.BrowserCountry) | COUNT(DISTINCT hits.parquet.BrowserLanguage) |
   
+----------------+---------------------------------------+---------------------------------------------+----------------------------------------------+
   | GU             | 3                                     | 1                 
                          | 2                                            |
   | _P             | 3                                     | 1                 
                          | 3                                            |
   | �
                    | 3                                     | 1                 
                          | 83                                           |
   | 4v             | 3                                     | 1                 
                          | 2                                            |
   | nH             | 3                                     | 1                 
                          | 3                                            |
   | nS             | 3                                     | 1                 
                          | 4                                            |
   | MI             | 3                                     | 1                 
                          | 6                                            |
   | 9F             | 3                                     | 1                 
                          | 2                                            |
   | Sc             | 3                                     | 1                 
                          | 3                                            |
   | BO             | 3                                     | 1                 
                          | 3                                            |
   
+----------------+---------------------------------------+---------------------------------------------+----------------------------------------------+
   10 rows in set. Query took 1.325 seconds.
   ```
   



##########
datafusion/physical-expr/src/aggregate/count_distinct.rs:
##########
@@ -438,6 +443,194 @@ where
     }
 }
 
+#[derive(Debug)]
+struct StringDistinctCountAccumulator(SSOStringHashSet);
+impl StringDistinctCountAccumulator {
+    fn new() -> Self {
+        Self(SSOStringHashSet::new())
+    }
+}
+
+impl Accumulator for StringDistinctCountAccumulator {
+    fn state(&self) -> Result<Vec<ScalarValue>> {
+        let arr = StringArray::from_iter_values(self.0.iter());

Review Comment:
   I think we can significantly improve this function by building the 
StringArray directly from 
    BufferBuilder:
   
https://github.com/apache/arrow-datafusion/pull/8827/files#diff-b8e579478a16c2c74fb4e58dac83a2d7a74a40105c5329ccc9bfc366b6f89327R210-R244
   
   ```rust
   ...
           // to emit, we need to create the offsets array and build the string
           // array from that
           let mut offsets = vec![0; self.num_groups + 1];
   
           // iterate over each entry in the map, filling in
           // `offset[0]` = 0
           // `offset[i]` = length of group `i+1`
           self.map.drain().for_each(|group_entry| {
               // since we created each group sequentially,
               // we know the offsets will be sequential as well
               offsets[group_entry.group_idx + 1] = group_entry.len as i32
           });
           // transform lengths to offsets. For example lengths of `[1, 2, 3]` 
becomes
           // offsets of `[0, 1, 3, 6]`
           for i in 1..offsets.len() {
               offsets[i] += offsets[i - 1];
           }
   
           // compute null and null offsets
           let nulls = if let Some(null_group) = self.null_group.take() {
               // ("null" buffer is actually a validity buffer, where 0 is null 
and 1 is valid)
               let mut nulls = BooleanBufferBuilder::new(self.num_groups);
               nulls.append_n(self.num_groups, true);
               nulls.set_bit(null_group, false);
               Some(NullBuffer::new(nulls.finish()))
           } else {
               None
           };
   
           self.num_groups = 0;
   
           // TODO update memory accounting memory
   
           let offsets = ScalarBuffer::from(offsets);
           let string_array = StringArray::new(OffsetBuffer::new(offsets), 
values, nulls);
   
           Ok(vec![Arc::new(string_array)])
       }
   
       fn clear_shrink(&mut self, batch: &RecordBatch) {
           // TODO review
           let count = batch.num_rows();
           self.map.clear();
           self.map.shrink_to(count, |_| 0); // hasher does not matter since 
the map is cleared
           self.map_size = self.map.capacity() * 
std::mem::size_of::<GroupEntry>();
           self.hashes_buffer.clear();
           self.hashes_buffer.shrink_to(count);
       }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Short string optimized`DistinctCountAccumulator` for string [arrow-datafusion]

Reply via email to