[GitHub] [arrow-datafusion] alamb opened a new pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

GitBox Tue, 27 Jul 2021 11:09:22 -0700


alamb opened a new pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786



   Draft until I complete performance comparisons
   
   # Which issue does this PR close?
   
   Fixes https://github.com/apache/arrow-datafusion/issues/376
   
   
    # Rationale for this change
   This proposal is both both a code cleanup and a step towards computing the 
correct answers when group by has nulls 
(https://github.com/apache/arrow-datafusion/issues/781 and 
https://github.com/apache/arrow-datafusion/issues/782).
   
   Since `GroupByScalar` does not have any notion of a NULL or None now, but 
`ScalarValue` does, rather than trying to teach GroupByScalar about Nulls it 
seemed like a good opportunity to remove it entirely.
   
   In parallel I am working on a proposal for the rest of #781
   
   
   # What changes are included in this PR?
   1. Remove GroupByScalar and related conversion code
   2. Add `Eq` and `Hash` implementations for `ScalarValue`, based on 
`OrderedFloat` (which was used by `GroupByScalar`)
   3. Add `impl From<Option<..>>` for `ScalarValue` to make it easier to use
   
   # Are there any user-facing changes?
   Not functionally (though more memory will be used when grouping)
   
   # Benchmark results
   (in progress)
   
   # Concerns
   This change increases the size of each potential group key by a factor of 4 
in HashAggregates to `64`, the `size_of(ScalarValue)` up from from `16` (the 
size of `GroupScalar`).
   
   This optimization was added by @Dandandan in 
https://github.com/apache/arrow-datafusion/commit/1ecdf3f7cf6ff5c03e14acc73271d356f191eb33
 / https://github.com/apache/arrow/pull/8765 and I would be interested in 
hearing his thoughts on the matter here. In any event we will likely need more 
than the existing 16 bytes per group key to handle null values, but we probably 
don't need an extra 56 bytes per key.
   
   Some options I can think of
   1. Take the increased memory hit for now (this PR) and re-optimize later
   2. Take the approach in https://github.com/apache/arrow/pull/8765 and 
`Box`/`Arc` the non primitive values in ScalarValue (which might have some 
small overhead)
   
   Perhaps @jorgecarleitao  has some additional thoughts
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Reply via email to