nealrichardson opened a new issue, #41358:
URL: https://github.com/apache/arrow/issues/41358

   ### Describe the enhancement requested
   
   It would be nice to be able to treat missing values in join keys as 
meaningful data you can match to. To illustrate using R, suppose I take a 
simple table and compute a hash aggregation on it:
   
   ``` r
   library(dplyr)
   
   df <- data.frame(
     a = 1:6,
     b = c("a", "a", NA, NA, "b", "b")
   )
   
   means <- df |>
     group_by(b) |>
     summarize(mean_a = mean(a))
   means
   
   #> # A tibble: 3 × 2
   #>   b     mean_a
   #>   <chr>  <dbl>
   #> 1 a        1.5
   #> 2 b        5.5
   #> 3 <NA>     3.5
   ```
   
   Now I want to left-join the result back to the table:
   
   ```r
   left_join(df, means, by = "b")
   #>   a    b mean_a
   #> 1 1    a    1.5
   #> 2 2    a    1.5
   #> 3 3 <NA>    3.5
   #> 4 4 <NA>    3.5
   #> 5 5    b    5.5
   #> 6 6    b    5.5
   ```
   
   `dplyr` treats `NA` as meaningful data and pulls in the corresponding 3.5 
value from the `mean_a` column.
   
   If I do the same thing in arrow though, I get missing values for `mean_a` 
where the join key `b` is missing:
   
   ``` r
   library(arrow)
   
   tab <- arrow_table(df)
   
   mean_tab <- tab |>
     group_by(b) |>
     summarize(mean_a = mean(a))
   
   left_join(tab, mean_tab, by = "b") |>
     collect()
   #> # A tibble: 6 × 3
   #>       a b     mean_a
   #>   <int> <chr>  <dbl>
   #> 1     1 a        1.5
   #> 2     2 a        1.5
   #> 3     5 b        5.5
   #> 4     6 b        5.5
   #> 5     3 <NA>    NA  
   #> 6     4 <NA>    NA
   ```
   
   dplyr's join functions have a "na_matches" argument that governs this 
behavior:
   
   > **na_matches** 
   >     Should two NA or two NaN values match?
   > 
   >     * "na", the default, treats two NA or two NaN values as equal, like 
%in%, base::match(), and base::merge().
   > 
   >     * "never" treats two NA or two NaN values as different, and will never 
match them together or to any other values. This is similar to joins for 
database sources and to base::merge(incomparables = NA).
   
   If you change the default to "never", dplyr matches the behavior acero has 
today:
   
   ```r
   left_join(df, means, by = "b", na_matches = "never")
   #>   a    b mean_a
   #> 1 1    a    1.5
   #> 2 2    a    1.5
   #> 3 3 <NA>     NA
   #> 4 4 <NA>     NA
   #> 5 5    b    5.5
   #> 6 6    b    5.5
   ```
   
   I don't see that acero has this option today. It would be great if it did 
have that option.
   
   ### Component(s)
   
   C++, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to