egillax opened a new issue, #35334: URL: https://github.com/apache/arrow/issues/35334
### Describe the bug, including details regarding any error messages, version, and platform. When joining and the columns are `numeric` the performance scales really badly with size. I was noticing a huge speed difference when joining columns on `numeric` columns vs `int32/int64` columns when by mistake my columns were `numeric`. This is not the case for `dplyr`. I made the following figure by creating arrow tables of increasing sizes and joined them with another table a third of the size (see bottom for code):  As can be seen when join columns are `numeric` it scales much worse than the other ones. I'm not even sure it makes sense to join on numerics... but I think this could be a bug. Or at least something worth investigating. <details> <summary>Code to reproduce</summary> ```R sizes <- c(1e3, 1e4, 1e5, 1e6) results <- data.frame() for (size in sizes) { size1 <- size size2 <- floor(size/3) # int 32 id column dfInt1 <- data.frame(id=sample(1:size1, replace=F), value=runif(size1)) dfInt2 <- data.frame(id=sample(1:size2, replace=F), value2=runif(size2)) arrowInt1 <- arrow::as_arrow_table(dfInt1) arrowInt2 <- arrow::as_arrow_table(dfInt2) # int 64 id column type <- bit64::as.integer64 dfInt64_1 <- data.frame(id=type(sample(1:size1, replace=F)), value=runif(size1)) dfInt64_2 <- data.frame(id=type(sample(1:size2, replace=F)), value2=runif(size2)) arrowInt64_1 <- arrow::as_arrow_table(dfInt64_1) arrowInt64_2 <- arrow::as_arrow_table(dfInt64_2) # numeric id column type <- as.numeric dfNum1 <- data.frame(id=type(sample(1:size1, replace=F)), value=runif(size1)) dfNum2 <- data.frame(id=type(sample(1:size2, replace=F)), value2=runif(size2)) arrowNum1 <- arrow::as_arrow_table(dfNum1) arrowNum2 <- arrow::as_arrow_table(dfNum2) arrowResultsInt <- arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute() arrowResultsInt64 <- arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> dplyr::compute() arrowResultsNum <- arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> dplyr::compute() res <- microbenchmark::microbenchmark(arrowInt32=arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute(), arrowInt64=arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> dplyr::compute(), arrowNum=arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> dplyr::compute(), dplyrNum=dfNum1 |> dplyr::inner_join(dfNum2, by='id'), dplyrInt=dfInt1 |> dplyr::inner_join(dfInt2, by='id'), check = NULL, times=10 ) res <- summary(res) res$size <- size results <- rbind(res, results) } ggplot2::ggplot(data=results, ggplot2::aes(x=size, y=mean, group=expr, color=expr)) + ggplot2::geom_line() + ylab('mean time (ms)') ``` </details> ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
