alamb commented on code in PR #8849:
URL: https://github.com/apache/arrow-datafusion/pull/8849#discussion_r1454134397
##########
benchmarks/queries/clickbench/README.md:
##########
@@ -28,6 +28,23 @@ SELECT
FROM hits;
```
+### Q1
+Query to test distinct count for String. Three of them are all small string
(length either 1 or 2).
Review Comment:
I think these queries also fall under the data exploration category:
```suggestion
Models initial Data exploration, to understand some statistics of data.
Query to test distinct count for String. Three of them are all small string
(length either 1 or 2).
```
##########
benchmarks/queries/clickbench/README.md:
##########
@@ -28,6 +28,23 @@ SELECT
FROM hits;
```
+### Q1
+Query to test distinct count for String. Three of them are all small string
(length either 1 or 2).
+
+```sql
+SELECT
+ COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"),
COUNT(DISTINCT "BrowserLanguage")
+FROM hits;
+```
+
+### Q2
+Query to test distinct count for String. "URL" has length greater than 8
Review Comment:
Does this query test something signfiicantly different than the Q0 and Q1?
What do you think about maybe adding a GROUP BY as well?
```
❯ SELECT "BrowserCountry", COUNT(DISTINCT "HitColor"), COUNT(DISTINCT
"BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM 'hits.parquet' GROUP
BY 1 ORDER BY 2 DESC LIMIT 10;
+----------------+---------------------------------------+---------------------------------------------+----------------------------------------------+
| BrowserCountry | COUNT(DISTINCT hits.parquet.HitColor) | COUNT(DISTINCT
hits.parquet.BrowserCountry) | COUNT(DISTINCT hits.parquet.BrowserLanguage) |
+----------------+---------------------------------------+---------------------------------------------+----------------------------------------------+
| GU | 3 | 1
| 2 |
| _P | 3 | 1
| 3 |
| �
| 3 | 1
| 83 |
| 4v | 3 | 1
| 2 |
| nH | 3 | 1
| 3 |
| nS | 3 | 1
| 4 |
| MI | 3 | 1
| 6 |
| 9F | 3 | 1
| 2 |
| Sc | 3 | 1
| 3 |
| BO | 3 | 1
| 3 |
+----------------+---------------------------------------+---------------------------------------------+----------------------------------------------+
10 rows in set. Query took 1.325 seconds.
```
##########
datafusion/physical-expr/src/aggregate/count_distinct.rs:
##########
@@ -438,6 +443,194 @@ where
}
}
+#[derive(Debug)]
+struct StringDistinctCountAccumulator(SSOStringHashSet);
+impl StringDistinctCountAccumulator {
+ fn new() -> Self {
+ Self(SSOStringHashSet::new())
+ }
+}
+
+impl Accumulator for StringDistinctCountAccumulator {
+ fn state(&self) -> Result<Vec<ScalarValue>> {
+ let arr = StringArray::from_iter_values(self.0.iter());
Review Comment:
I think we can significantly improve this function by building the
StringArray directly from
BufferBuilder:
https://github.com/apache/arrow-datafusion/pull/8827/files#diff-b8e579478a16c2c74fb4e58dac83a2d7a74a40105c5329ccc9bfc366b6f89327R210-R244
```rust
...
// to emit, we need to create the offsets array and build the string
// array from that
let mut offsets = vec![0; self.num_groups + 1];
// iterate over each entry in the map, filling in
// `offset[0]` = 0
// `offset[i]` = length of group `i+1`
self.map.drain().for_each(|group_entry| {
// since we created each group sequentially,
// we know the offsets will be sequential as well
offsets[group_entry.group_idx + 1] = group_entry.len as i32
});
// transform lengths to offsets. For example lengths of `[1, 2, 3]`
becomes
// offsets of `[0, 1, 3, 6]`
for i in 1..offsets.len() {
offsets[i] += offsets[i - 1];
}
// compute null and null offsets
let nulls = if let Some(null_group) = self.null_group.take() {
// ("null" buffer is actually a validity buffer, where 0 is null
and 1 is valid)
let mut nulls = BooleanBufferBuilder::new(self.num_groups);
nulls.append_n(self.num_groups, true);
nulls.set_bit(null_group, false);
Some(NullBuffer::new(nulls.finish()))
} else {
None
};
self.num_groups = 0;
// TODO update memory accounting memory
let offsets = ScalarBuffer::from(offsets);
let string_array = StringArray::new(OffsetBuffer::new(offsets),
values, nulls);
Ok(vec![Arc::new(string_array)])
}
fn clear_shrink(&mut self, batch: &RecordBatch) {
// TODO review
let count = batch.num_rows();
self.map.clear();
self.map.shrink_to(count, |_| 0); // hasher does not matter since
the map is cleared
self.map_size = self.map.capacity() *
std::mem::size_of::<GroupEntry>();
self.hashes_buffer.clear();
self.hashes_buffer.shrink_to(count);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]