drusso commented on pull request #8222:
URL: https://github.com/apache/arrow/pull/8222#issuecomment-703289564


   I've rebased against the latest changes on the master branch. Some notes and 
questions about the changes:
   
   * Most of the changes are scoped to DataFusion, but an update to the 
`concat` kernel was required. Can the Arrow changes be bundled with the 
DataFusion changes, or should this be a separate issue/pull request?
   * I kept the changes to the `concat()` kernel as small as possible, scoped 
strictly to support the count distinct feature in this pull request. However, 
there is room for improvement here, because we can support lists of any data 
type.
   * Is there an implementation to go from `Vec<Vec<_>>` to `ListArray`? I 
wasn't aware of one, but something along those lines I think would improve the 
readability of the `test_concat_primitive_list_arrays` test added in the 
`concat` kernel.
   * Made the switch from `LargeListArray` to `ListArray`. This was motivated 
by the `take()` kernel already supporting `ListArray`, but not 
`LargeListArray`. Supporting `LargeListArray` would require handling of `i64` 
offsets – not for addressing the array's lists (`i32` is fine for that), but 
for addressing the array's lists' values (there is currently a recursive call 
to take values 
[here](https://github.com/apache/arrow/blob/57f548c743acfb4d8311fae034abf790ecb2d374/rust/arrow/src/compute/kernels/take.rs#L328)).
 Nothing that can't be solved, but I thought it best to leave that out of scope 
for this particular pull request.
   * Following the #8172 changes, and since aggretations now have their own 
state, there was no longer a need for a new `AggregateMode` variant. This 
change was removed.
   * As discussed, I've replaced `DistinctScalarValue` with `GroupByScalar`. To 
do so, I promoted `GroupByScalar` to its own module.
   * As described in the issue, the goal is to implement `COUNT(DISTINCT col)` 
for a _single_ argument. Much of this implementation should work as-is for 
multiple arguments, but I haven't yet tested or tried that.
   * The state of `DistinctCountAccumulator` uses one field per input input 
argument. Each field is emitted as a `ListArray`, i.e.,  a vector of _N_ arrays 
of lists of primitives.  An alternate implementation, which may be worth 
exploring, would be to use a _single_ state field, and it emits a vector on _1_ 
array of lists of structs of primitives.
   
   Let me know what you think and if any changes are needed. Thanks!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to