jhorstmann commented on issue #790: URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888607125
More concretely regarding the proposal, how exactly the signature works is still a bit unclear to me. If it has a fixed length and is calculated somehow, then there is a possibility of collisions. If it is supposed to be an integer sequence then we'd need another hashmap to create it. To fix the immediate problem of null values, I would try to encode them inline into the `Vec<u8>`, for example by prepending a 0 or 1 byte before the bytes of each group by column. The bytes for non-valid entries then also need to be set to a default value for that type. Maybe there is a smart way to omit the bytes for these non-valid entries, but care is needed to ensure no two different keys get the same encoded bytes. I think in the current group by implementation, this vector is fully responsible for the equals and hashcode. That means the GroupByScalar implementation for Eq and Hash are not really used, and we could replace that with ScalarValue. The creation of this ScalarValue is already only happening when a new entry needs to be inserted into the hashmap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
