alamb opened a new pull request #793: URL: https://github.com/apache/arrow-datafusion/pull/793
# Which issue does this PR close? This PR closes https://github.com/apache/arrow-datafusion/issues/781 and https://github.com/apache/arrow-datafusion/issues/782; Built on https://github.com/apache/arrow-datafusion/pull/786 so review that first. # Rationale for this change 1. Grouping on columns that contain NULLs today produces incorrect results 2. This is what I think is the minimum change required to produce correct results This is a version of the "Alternative" approach described in https://github.com/apache/arrow-datafusion/issues/790 which I think is the minimum change to GroupByHash to produce the correct answers when grouping on columns that contain nulls It will likely reduce the speed of grouping as well as require more memory than the current implementation (though it does get correct answers!) I created this PR it available for comparison and as a fallback in case I run into trouble or run out of time time trying to implement https://github.com/apache/arrow-datafusion/issues/790, which I expect will take longer to code and review. # What changes are included in this PR? 1. https://github.com/apache/arrow-datafusion/pull/786 2. Include a "null byte" for each column when creating the key to the hash table 3. Fix some bugs related to NULL handling in `ScalarValue` 4. Tests On master keys are created like this: ``` string len 0x1234 { (as usize le) "foo" (as u16 le) k1: "foo" ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ k2: 0x1234u16 │03│00│00│00│00│00│00│00│"f│"o│"o│34│12│ } └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ 0 1 2 3 4 5 6 7 8 9 10 11 12 ``` After this PR, the keys are created as follows (note the two extra bytes, one for each grouping column) Example of a key without any nulls: ``` 0xFF byte at the start of each column signifies the value is non-null │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┐ │ string len │ 0x1234 { ▼ (as usize le) "foo" ▼(as u16 le) k1: "foo" ╔ ═┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──╦ ═┌──┬──┐ k2: 0x1234u16 FF║03│00│00│00│00│00│00│00│"f│"o│"o│FF║34│12│ } ╚ ═└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──╩ ═└──┴──┘ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ``` Example of a key with NULL values: ``` 0xFE byte at the start of k1 column ┌ ─ signifies the value is NULL └ ┐ 0x1234 { ▼ (as u16 le) k1: NULL ╔ ═╔ ═┌──┬──┐ k2: 0x1234u16 FE║FF║12│34│ } ╚ ═╚ ═└──┴──┘ 0 1 2 3 ``` # Are there any user-facing changes? Correct answers! # Benchmark results ... Coming soon ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org