alamb opened a new pull request #786: URL: https://github.com/apache/arrow-datafusion/pull/786
Draft until I complete performance comparisons # Which issue does this PR close? Fixes https://github.com/apache/arrow-datafusion/issues/376 # Rationale for this change This proposal is both both a code cleanup and a step towards computing the correct answers when group by has nulls (https://github.com/apache/arrow-datafusion/issues/781 and https://github.com/apache/arrow-datafusion/issues/782). Since `GroupByScalar` does not have any notion of a NULL or None now, but `ScalarValue` does, rather than trying to teach GroupByScalar about Nulls it seemed like a good opportunity to remove it entirely. In parallel I am working on a proposal for the rest of #781 # What changes are included in this PR? 1. Remove GroupByScalar and related conversion code 2. Add `Eq` and `Hash` implementations for `ScalarValue`, based on `OrderedFloat` (which was used by `GroupByScalar`) 3. Add `impl From<Option<..>>` for `ScalarValue` to make it easier to use # Are there any user-facing changes? Not functionally (though more memory will be used when grouping) # Benchmark results (in progress) # Concerns This change increases the size of each potential group key by a factor of 4 in HashAggregates to `64`, the `size_of(ScalarValue)` up from from `16` (the size of `GroupScalar`). This optimization was added by @Dandandan in https://github.com/apache/arrow-datafusion/commit/1ecdf3f7cf6ff5c03e14acc73271d356f191eb33 / https://github.com/apache/arrow/pull/8765 and I would be interested in hearing his thoughts on the matter here. In any event we will likely need more than the existing 16 bytes per group key to handle null values, but we probably don't need an extra 56 bytes per key. Some options I can think of 1. Take the increased memory hit for now (this PR) and re-optimize later 2. Take the approach in https://github.com/apache/arrow/pull/8765 and `Box`/`Arc` the non primitive values in ScalarValue (which might have some small overhead) Perhaps @jorgecarleitao has some additional thoughts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
