Re: [I] [C++] Acero cannot join large tables because of a misspecified test [arrow]

via GitHub Mon, 16 Oct 2023 09:32:42 -0700


oliviermeslin commented on issue #37655:
URL: https://github.com/apache/arrow/issues/37655#issuecomment-1764857405


   After some additional test, I discovered that this bug is actually related 
to the size of the right table, but insensitive to the size of the left table. 
So the bug is that the 4 GB key size test is applied to the size of the 
complete right table.
   
   Here are some additional tests to show this asymetrical behavior.
   
   ```
   library(stringi)
   library(arrow)
   library(dplyr)
   
   
   # Generate a large number of rows with ONE heavy key column
   n_rows <- 2e7
   length_id <- 20
   ids <- stringi::stri_rand_strings(n_rows, length = length_id)
   
   # Create a large Arrow Table with heavy payloads
   data <- data.frame(
     id = ids
   ) |> 
     as_arrow_table() |>
     mutate(
       # Create payload variables
       variable1   = id,
       variable2   = id,
       variable3   = id,
       variable4   = id,
       variable5   = id,
       variable6   = id,
       variable7   = id,
       variable8   = id,
       variable9   = id,
       variable10  = id
     ) |>
     compute()
   
   
   vars <- names(data)[!names(data) %in% c("id")]
   nb_var <- length(vars)
   
   # Join a heavy left dataset with itself, with an increasing number of 
variables on the right
   # This fails when the right table has 9 variables
   lapply(7:nb_var, function(n) {
     print(paste0("Doing the join with ", n+1, " variables"))
     vars_temp <- c("id", vars[1:n])
     print(vars_temp)
     data_out <- data |> 
       left_join(
         data |>
           select(all_of(vars_temp)),
         by = c("id" = "id")
       ) |>
       compute()
     return("Success!")
   })
   
   # Join a lighter left dataset with itself, with an increasing number of 
variables on the right
   # This fails again when the right table has 9 variables
   lapply(7:nb_var, function(n) {
     print(paste0("Doing the join with ", n+1, " variables"))
     vars_temp <- c("id", vars[1:n])
     print(vars_temp)
     data_out <- data |> select(id) |>
       left_join(
         data |>
           select(all_of(vars_temp)),
         by = c("id" = "id")
       ) |>
       compute()
     return("Success!")
   })
   
   # Join the dataset with itself, with an increasing number of variables on 
the left and the full dataset on the right
   # This fails with only two variables on the left
   lapply(1:nb_var, function(n) {
     print(paste0("Doing the join with ", n+1, " variables"))
     vars_temp <- c("id", vars[1:n])
     print(vars_temp)
     data_out <- data |>
       select(all_of(vars_temp)) |> 
       left_join(
         data,
         by = c("id" = "id")
       ) |>
       compute()
     return("Success!")
   })
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Acero cannot join large tables because of a misspecified test [arrow]

Reply via email to