jhorstmann commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888607125


   More concretely regarding the proposal, how exactly the signature works is 
still a bit unclear to me. If it has a fixed length and is calculated somehow, 
then there is a possibility of collisions. If it is supposed to be an integer 
sequence then we'd need another hashmap to create it.
   
   To fix the immediate problem of null values, I would try to encode them 
inline into the `Vec<u8>`, for example by prepending a 0 or 1 byte before the 
bytes of each group by column. The bytes for non-valid entries then also need 
to be set to a default value for that type. Maybe there is a smart way to omit 
the bytes for these non-valid entries, but care is needed to ensure no two 
different keys get the same encoded bytes.
   
   I think in the current group by implementation, this vector is fully 
responsible for the equals and hashcode. That means the GroupByScalar 
implementation for Eq and Hash are not really used, and we could replace that 
with ScalarValue. The creation of this ScalarValue is already only happening 
when a new entry needs to be inserted into the hashmap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to