[jira] [Commented] (ARROW-16897) [R][C++] Full join on Arrow objects is incorrect

Weston Pace (Jira) Wed, 29 Jun 2022 21:00:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560809#comment-17560809
 ]


Weston Pace commented on ARROW-16897:
-------------------------------------

I believe I have found the root cause.  I'll give some more detail.  In Acero a 
join node has a left output and a right output.  Key columns can be output on 
both the left and the right side.  In an inner join this just leads to 
duplicate data.  In an outer join it's a bit more subtle.

For example, consider the inner join of the above (years 2000 - 2005 but only 
2005 has a match in both):
||left.group||left.year||right.group||right.year||value||
|g1|2005|g1|2005|-1.008|

In an inner join you only get rows that have matches in both tables and so the 
two key columns will always be identical.  However, in an outer join, things 
are different.  Here is the same outer join for the above five years:
||left.group||left.year||right.group||right.year||value||
|NA|NA|g1|2000|NA|
|NA|NA|g1|2001|NA|
|NA|NA|g1|2002|NA|
|NA|NA|g1|2003|NA|
|NA|NA|g1|2004|NA|
|g1|2005|g1|2005|-1.008|

Note that it would be possible to get values in the left keys with NA in the 
right keys if the right table was missing data (in this particular example the 
right table has the full set of keys and so it is never missing rows).

Acero is actually giving you more information that you would normally get out 
of a standard SQL server.  In a standard SQL server you would never know 
whether a row simple did not have a match or the row matched but all the 
payload columns were null.  And, if a row didn't have a match and all the 
payload columns were null you wouldn't know which side had they key columns.  
Still, this information isn't generally useful and one could portray this as 
calling a bug a feature :)

R currently deals with this duplication of key columns by always excluding the 
key columns from the right output.  So, given the above, R throws away 
{{right.group}} and {{left.group}} and you end up with:

 
||left.group||left.year||value||
|NA|NA|NA|
|NA|NA|NA|
|NA|NA|NA|
|NA|NA|NA|
|NA|NA|NA|
|g1|2005|-1.008|

I'm not sure if Acero's join node is dropping the rows with all nulls or if 
that is happening later on in some further node but these aren't very useful 
rows anyways.  If I reverse R's logic so that left_output is "let fields minus 
keys" and right_output is "right fields" then I get the correct result.

However, the correct answer is not to always drop the keys from the right side 
or to always drop the keys from the left side but instead to always coalesce 
the key columns.

 

Fortunately, we have a PR that adds this as an option (append_consolidated_key) 
to the hash join node options and will take care of the coalescing for you: 
[https://github.com/michalursa/arrow/pull/7/files#diff-985b684376a17de896e07a2ff2362934dbf8fa2483edeaeff6910907f6f96733]

 

Unfortunately, this PR was built on top of the improved hash-join node which 
hasn't yet merged.

> [R][C++] Full join on Arrow objects is incorrect
> ------------------------------------------------
>
>                 Key: ARROW-16897
>                 URL: https://issues.apache.org/jira/browse/ARROW-16897
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 8.0.0
>         Environment: Linux
>            Reporter: Oliver Reiter
>            Assignee: Weston Pace
>            Priority: Critical
>              Labels: joins, query-engine
>             Fix For: 9.0.0
>
>
> Hello,
> I am trying to do a full join on a dataset. It produces the correct number of 
> observations, but not the correct result (the resulting data.frame is just 
> filled up with NA-rows).
> My use case: I want to include the 'full' year range for every factor value:
> {code:java}
> library(data.table)
> library(arrow)
> library(dplyr)
> year_range <- 2000:2019
> group_n <- 100
> N <- 1000 ## the resulting data should have 100 groups * 20 years
> dt <- data.table(value = rnorm(N),
>                  group = rep(paste0("g", 1:group_n), length.out = N))
> ## there are only observations for some years in every group
> dt[, year := sample(year_range, size = N / group_n), by = .(group)]
> dt[group == "g1", ]
> ## this would be the 'full' data.table
> group_years <- data.table(group = rep(unique(dt$group), each = 20),
>                           year = rep(year_range, times = 10))
> group_years[group == "g1", ]
> write_dataset(dt, path = "parquet_db")
> db <- open_dataset(sources = "parquet_db")
> ## full_join using data.table -> expected result
> db_full <- merge(dt, group_years,
>                  by = c("group", "year"),
>                  all = TRUE)
> setorder(db_full, group, year)
> db_full[group == "g1", ]
> ## try to do the full_join with arrow -> incorrect result
> db_full_arrow <- db |>
>   full_join(group_years, by = c("group", "year")) |>
>   collect() |>
>   setDT()
> setorder(db_full_arrow, group, year)
> db_full_arrow[group == "g1", ]
> ## or: convert data.table to arrow_table beforehand -> incorrect result
> group_years_arrow <- group_years |>
>   as_arrow_table()
> db_full_arrow <- db |>
>   full_join(group_years_arrow, by = c("group", "year")) |>
>   collect() |>
>   setDT()
> setorder(db_full_arrow, group, year)
> db_full_arrow[group == "g1", ]{code}
> The [documentation|https://arrow.apache.org/docs/r/] says equality joins are 
> supported, which should hold also for `full_join` I guess?
> Thanks for your time and work!
>  
> Oliver



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16897) [R][C++] Full join on Arrow objects is incorrect

Reply via email to