zanmato1984 commented on issue #43768:
URL: https://github.com/apache/arrow/issues/43768#issuecomment-2301194348
I guess your major concern is that counting a valid scalar only once (i.e.,
`this->count += scalar.is_valid`), as opposed to the actual batch length times
(i.e., `this->count += scalar.is_valid * batch.length` as you proposed), would
make the count less and emit wrong result given how this count is used:
```
class ARROW_EXPORT ScalarAggregateOptions : public FunctionOptions {
...
/// If less than this many non-null values are observed, emit null.
uint32_t min_count;
};
```
```
if ((!options.skip_nulls && !this->any && this->has_nulls) ||
this->count < options.min_count) {
out->value = std::make_shared<BooleanScalar>();
}
```
For example, assume a batch contains 3 rows and the `min_count` is 2.
According to the current logic, `any(false)` would have count 1 (as opposed to
the batch length 3) and emit `null` (`count == 1 < min_count == 2`). I think
this is wrong.
@felipecrv What do you think? Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]