oliviermeslin commented on issue #37655:
URL: https://github.com/apache/arrow/issues/37655#issuecomment-1764857405
After some additional test, I discovered that this bug is actually related
to the size of the right table, but insensitive to the size of the left table.
So the bug is that the 4 GB key size test is applied to the size of the
complete right table.
Here are some additional tests to show this asymetrical behavior.
```
library(stringi)
library(arrow)
library(dplyr)
# Generate a large number of rows with ONE heavy key column
n_rows <- 2e7
length_id <- 20
ids <- stringi::stri_rand_strings(n_rows, length = length_id)
# Create a large Arrow Table with heavy payloads
data <- data.frame(
id = ids
) |>
as_arrow_table() |>
mutate(
# Create payload variables
variable1 = id,
variable2 = id,
variable3 = id,
variable4 = id,
variable5 = id,
variable6 = id,
variable7 = id,
variable8 = id,
variable9 = id,
variable10 = id
) |>
compute()
vars <- names(data)[!names(data) %in% c("id")]
nb_var <- length(vars)
# Join a heavy left dataset with itself, with an increasing number of
variables on the right
# This fails when the right table has 9 variables
lapply(7:nb_var, function(n) {
print(paste0("Doing the join with ", n+1, " variables"))
vars_temp <- c("id", vars[1:n])
print(vars_temp)
data_out <- data |>
left_join(
data |>
select(all_of(vars_temp)),
by = c("id" = "id")
) |>
compute()
return("Success!")
})
# Join a lighter left dataset with itself, with an increasing number of
variables on the right
# This fails again when the right table has 9 variables
lapply(7:nb_var, function(n) {
print(paste0("Doing the join with ", n+1, " variables"))
vars_temp <- c("id", vars[1:n])
print(vars_temp)
data_out <- data |> select(id) |>
left_join(
data |>
select(all_of(vars_temp)),
by = c("id" = "id")
) |>
compute()
return("Success!")
})
# Join the dataset with itself, with an increasing number of variables on
the left and the full dataset on the right
# This fails with only two variables on the left
lapply(1:nb_var, function(n) {
print(paste0("Doing the join with ", n+1, " variables"))
vars_temp <- c("id", vars[1:n])
print(vars_temp)
data_out <- data |>
select(all_of(vars_temp)) |>
left_join(
data,
by = c("id" = "id")
) |>
compute()
return("Success!")
})
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]