edponce commented on a change in pull request #11257:
URL: https://github.com/apache/arrow/pull/11257#discussion_r718152329
##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -121,6 +122,84 @@ Result<std::unique_ptr<KernelState>>
CountInit(KernelContext*,
static_cast<const CountOptions&>(*args.options));
}
+// ----------------------------------------------------------------------
+// Distinct Count implementation
+
+template <typename Type>
+struct CountDistinctImpl : public ScalarAggregator {
+ using MemoTable = typename arrow::internal::HashTraits<Type>::MemoTableType;
+
+ explicit CountDistinctImpl(MemoryPool* memory_pool, CountOptions options)
+ : options(std::move(options)), memo_table_(new MemoTable(memory_pool,
0)) {}
+
+ Status Consume(KernelContext*, const ExecBatch& batch) override {
+ if (batch[0].is_array()) {
+ const ArrayData& arr = *batch[0].array();
+ auto visit_null = [&]() {
+ if (this->nulls > 0) return Status::OK();
+ ++this->nulls;
+ return Status::OK();
+ };
Review comment:
IIUC, `this->nulls` is of interest in terms of it either being zero or
non-zero, and if it is non-zero then it is treated as a distinct value of 1.
The lambda has the check to try to short-circuit based on it being non-zero but
I will argue that the conditional check can be more expensive than a simple
integer add. So I would simplify it to an unconditional counter and use its
value once via `std::min(this->nulls, 1)`.
```c++
auto visit_null = [&]() {
++this->nulls;
return Status::OK();
}
```
Even better, I suggest to use `GetNullCount()` and make a dummy lambda:
```c++
this->nulls = arr.GetNullCount();
auto visit_null = []() { return Status::OK(); }
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]