[ 
https://issues.apache.org/jira/browse/ARROW-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618147#comment-17618147
 ] 

Neal Richardson commented on ARROW-17559:
-----------------------------------------

IIRC the R nyc-taxi benchmarks started failing due to the null column not being 
excluded from the projection anymore, so we should see the benchmarks 
succeeding again.

> [R][C++] Regression: big performance hit after removing schema binding
> ----------------------------------------------------------------------
>
>                 Key: ARROW-17559
>                 URL: https://issues.apache.org/jira/browse/ARROW-17559
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 9.0.0
>         Environment: ubuntu 2020
>            Reporter: Vitalie Spinu
>            Priority: Major
>              Labels: R, compute
>             Fix For: 10.0.0
>
>
> After ARROW-15260 I observe a big memory  and compute time increases with 
> basic sumarize queries. My use case shows almost 10x memory and 10x 
> computation time increases in some cases.  
> Here is a less dramatic replication along my real use case which gives 2x 
> time increase:
> {code:R}
>   library(arrow)
>   dir.create(dir <- "/tmp/iris", showWarnings = F)
>   for (day in seq_len(100)) {
>     dir.create(glue("{dir}/day={day}"), showWarnings = F)
>     for (i in seq_len(10)) {
>       dfs <- map(seq_len(20), function(j) {
>         names(iris) <- paste0(names(iris), j)
>         iris
>       })
>       df <- dplyr::bind_cols(!!!dfs)
>       write_parquet(df, glue("{dir}/day={day}/{i}.parquet"))
>     }
>   }
>   library(arrow)
>   system.time(
>     open_dataset("/tmp/iris") %>%
>     group_by(day, Species1) %>%
>     summarise(N = n(), .groups = "drop") %>%
>     collect())
> {code}
> Before commit 838687178: 0.33sec, after: 0.73sec. 
> If I put back the schema Binding which was removed 
> [here|https://github.com/apache/arrow/pull/12826/files#diff-0d1ff6f17f571f6a348848af7de9c05ed588d3339f46dd3bcf2808489f7dca92L235]
>  I get the performance back. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to