[GitHub] [arrow] aucahuasi commented on a change in pull request #11257: ARROW-14035: [C++][Python] Implement count distinct kernel

GitBox Wed, 29 Sep 2021 09:17:39 -0700


aucahuasi commented on a change in pull request #11257:
URL: https://github.com/apache/arrow/pull/11257#discussion_r718682358




##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic.cc
##########
@@ -754,6 +839,30 @@ void RegisterScalarAggregateBasic(FunctionRegistry* 
registry) {
                aggregate::CountInit, func.get());
   DCHECK_OK(registry->AddFunction(std::move(func)));
 
+  func = std::make_shared<ScalarAggregateFunction>(
+      "count_distinct", Arity::Unary(), &count_distinct_doc, 
&default_count_options);
+
+  // Takes any input, outputs int64 scalar
+  aggregate::AddCountDistinctKernel<Int8Type>(int8(), func.get());
+  aggregate::AddCountDistinctKernel<Int16Type>(int16(), func.get());
+  aggregate::AddCountDistinctKernel<Int32Type>(int32(), func.get());
+  aggregate::AddCountDistinctKernel<Date32Type>(date32(), func.get());
+  aggregate::AddCountDistinctKernel<Int64Type>(int64(), func.get());

Review comment:
       > We'd still register the same types, however, we'd be instructing the 
compiler to instantiate the kernel less (since e.g. TimestampType and Int64Type 
would both delegate to CountDistinctImpl)
   
   I see thanks.
   
   > make sure things like zoned timestamps are properly supported
   
   How does the timezone gets encoded into the timestamp? is it part of its 
physical data layout? I think is a good idea to add a test for this case and 
for intervals as well ;)
   
   > You could also do something like 
AddCountDistinctKernel<Int64Type>(uint64()) which would collapse the signed and 
unsigned implementations for each bit width.
   
   Does this will use implicit casting for kernel resolution?
   
   btw I can use an approach like SumLike no problem (is not that complicated), 
my point is that we will have more lines of code/abstractions at the end and we 
will make a potential refactor later to support more types.
   
   > at the very least, it may be worth putting up a follow-up JIRA. (Also 
because I think there's other things we may want to do, e.g. we could perhaps 
support dictionaries.)
   
   Yes that is part of my point too. I like to have consistency as well, but I 
think most of this refactor should take place once we identify we need to add 
other types and not before, is better to have a generic code but not to create 
internal API when we don't know yet the requirements for each case: i.e. when 
in doubt leave it out ;)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] aucahuasi commented on a change in pull request #11257: ARROW-14035: [C++][Python] Implement count distinct kernel

Reply via email to