[GitHub] [arrow] egillax opened a new issue, #35334: [R] join with numeric columns has poor performance

via GitHub Tue, 25 Apr 2023 08:23:15 -0700


egillax opened a new issue, #35334:
URL: https://github.com/apache/arrow/issues/35334


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When joining and the columns are `numeric` the performance scales really 
badly with size. I was noticing a huge speed difference when joining columns on 
`numeric` columns vs `int32/int64` columns when by mistake my columns were 
`numeric`. This is not the case for `dplyr`.
   
   I made the following figure by creating arrow tables of increasing sizes and 
joined them with another table a third of the size (see bottom for  code):
   
   
![image](https://user-images.githubusercontent.com/24678081/234317155-76ddf6c5-793e-4bd7-a704-1336a3aec57c.png)
   
   As can be seen when join columns are `numeric` it scales much worse than the 
other ones. I'm not even sure it makes sense to join on numerics... but I think 
this could be a bug. Or at least something worth investigating.
   
   <details>
     <summary>Code to reproduce</summary>
   
     ```R
   sizes <- c(1e3, 1e4, 1e5, 1e6)
   results <- data.frame()
   for (size in sizes) {
   size1 <- size
   size2 <- floor(size/3)
   
   # int 32 id column
   dfInt1 <- data.frame(id=sample(1:size1, replace=F),
                     value=runif(size1))
   dfInt2 <- data.frame(id=sample(1:size2, replace=F),
                     value2=runif(size2))
   arrowInt1 <- arrow::as_arrow_table(dfInt1)
   arrowInt2 <- arrow::as_arrow_table(dfInt2)
   
   # int 64 id column
   type <- bit64::as.integer64
   dfInt64_1 <- data.frame(id=type(sample(1:size1, replace=F)),
                           value=runif(size1))
   dfInt64_2 <- data.frame(id=type(sample(1:size2, replace=F)),
                           value2=runif(size2))
   arrowInt64_1 <- arrow::as_arrow_table(dfInt64_1)
   arrowInt64_2 <- arrow::as_arrow_table(dfInt64_2)
   
   # numeric id column
   type <- as.numeric
   dfNum1 <- data.frame(id=type(sample(1:size1, replace=F)),
                           value=runif(size1))
   dfNum2 <- data.frame(id=type(sample(1:size2, replace=F)),
                           value2=runif(size2))
   arrowNum1 <- arrow::as_arrow_table(dfNum1)
   arrowNum2 <- arrow::as_arrow_table(dfNum2)
   
   arrowResultsInt <- arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> 
dplyr::compute()
   arrowResultsInt64 <- arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, 
by='id') |> dplyr::compute()
   arrowResultsNum <- arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> 
dplyr::compute()
   
   res <- microbenchmark::microbenchmark(arrowInt32=arrowInt1 |> 
dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute(),
     arrowInt64=arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> 
dplyr::compute(),
     arrowNum=arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> 
dplyr::compute(),
     dplyrNum=dfNum1 |> dplyr::inner_join(dfNum2, by='id'),
     dplyrInt=dfInt1 |> dplyr::inner_join(dfInt2, by='id'),
     check = NULL, times=10
   )
   res <- summary(res)
   res$size <- size
   
   results <- rbind(res, results)
   
   }
   
   ggplot2::ggplot(data=results, ggplot2::aes(x=size, y=mean, group=expr, 
color=expr)) + 
     ggplot2::geom_line() + ylab('mean time (ms)')
     ```
   </details>
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] egillax opened a new issue, #35334: [R] join with numeric columns has poor performance

Reply via email to