alamb opened a new pull request #793:
URL: https://github.com/apache/arrow-datafusion/pull/793


   # Which issue does this PR close?
   This PR closes https://github.com/apache/arrow-datafusion/issues/781 and 
https://github.com/apache/arrow-datafusion/issues/782;
   
   Built on https://github.com/apache/arrow-datafusion/pull/786 so review that 
first.
   
    # Rationale for this change
   1. Grouping on columns that contain NULLs today produces incorrect results
   2. This is what I think is the minimum change required to produce correct 
results
   
   This is a version of the "Alternative" approach described in 
https://github.com/apache/arrow-datafusion/issues/790 which I think is the 
minimum change to GroupByHash to produce the correct answers when grouping on 
columns that contain nulls
   
   It will likely reduce the speed of grouping as well as require more memory 
than the current implementation (though it does get correct answers!)
   
   I created this PR  it available for comparison and as a fallback in case I 
run into trouble or run out of time time trying to implement 
https://github.com/apache/arrow-datafusion/issues/790, which I expect will take 
longer to code and review.
   
   # What changes are included in this PR?
   1. https://github.com/apache/arrow-datafusion/pull/786
   2. Include a "null byte" for each column when creating the key to the hash 
table
   3. Fix some bugs related to NULL handling in `ScalarValue`
   4. Tests
   
   On master keys are created like this:
   ```
                               string len                   0x1234
   {                          (as usize le)      "foo"    (as u16 le)
     k1: "foo"         ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
     k2: 0x1234u16     │03│00│00│00│00│00│00│00│"f│"o│"o│34│12│
   }                   └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
                       0  1  2  3  4  5  6  7  8  9  10 11 12
   ```
   
   After this PR, the keys are created as follows (note the two extra bytes, 
one for each grouping column)
   
   Example of a key without any nulls:
   
   ```
                          0xFF byte at the start of each column
                             signifies the value is non-null
                                            │
   
                        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┐
   
                        │        string len                 │  0x1234
   {                    ▼       (as usize le)      "foo"    ▼(as u16 le)
     k1: "foo"        ╔ ═┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──╦ ═┌──┬──┐
     k2: 0x1234u16     FF║03│00│00│00│00│00│00│00│"f│"o│"o│FF║34│12│
   }                  ╚ ═└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──╩ ═└──┴──┘
                       0  1  2  3  4  5  6  7  8  9  10 11 12 13 14
   ```
   
   Example of a key with NULL values:
   ```
                           0xFE byte at the start of k1 column
                       ┌ ─     signifies the value is NULL
   
                       └ ┐
                                0x1234
   {                     ▼    (as u16 le)
     k1: NULL          ╔ ═╔ ═┌──┬──┐
     k2: 0x1234u16      FE║FF║12│34│
   }                   ╚ ═╚ ═└──┴──┘
                         0  1  2  3
   ```
   
   
   # Are there any user-facing changes?
   Correct answers! 
   
   # Benchmark results
   
   ... Coming soon ...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to