djvanderlaan opened a new issue, #46428:
URL: https://github.com/apache/arrow/issues/46428
### Describe the bug, including details regarding any error messages,
version, and platform.
I noticed that some operations use substantially slower and use more memory
under arrow V20.0.0 and v19.0.0 than under v17.0.0. I managed to reduce the
example and am able to reproduce this both on a production machine running
ubuntu 22.04 and my home desktop (running debian stable).
The run of the example on my desktop with v17 took 10s and a maximum of
approx 7GB memory. The v20 run was killed after 1m16s because it ran out of
memory (my home machine is unfortunately limited to 24GB). Before being killed
the memory use peaked at approx 22GB. See below for the output.
The following code generates the example data:
```r
nvert <- 10E6
nedge <- 20E7
vert <- data.frame(
id = seq_len(nvert)
)
edges <- data.frame(
src = sample(nvert, nedge, replace = TRUE),
dst = sample(nvert, nedge, replace = TRUE),
type = sample(1:10, nedge, replace = TRUE)
)
library(arrow)
write_parquet(vert, "vertices.parquet")
write_parquet(edges, "edges.parquet")
```
The following script processes this data and shows substantial differences
between v20 and v17 (the production machine had v19 which showed the same
behaviour are v20).
```r
library(arrow)
library(dplyr)
sessionInfo()
vert <- read_parquet("vertices.parquet")
str(vert)
con <- open_dataset("edges.parquet")
con
dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect()
nrow(dta)
```
Below the results for v17:
```
$ /usr/bin/time -v R --no-save < test.R
R version 4.5.0 (2025-04-11) -- "How About a Twenty-Six"
Copyright (C) 2025 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
>
> library(arrow)
Attaching package: ‘arrow’
The following object is masked from ‘package:utils’:
timestamp
> library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
>
> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Debian GNU/Linux 12 (bookworm)
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0 LAPACK version
3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.4 arrow_17.0.0
loaded via a namespace (and not attached):
[1] assertthat_0.2.1 R6_2.6.1 bit_4.6.0 tidyselect_1.2.1
[5] magrittr_2.0.3 glue_1.8.0 tibble_3.2.1 pkgconfig_2.0.3
[9] bit64_4.6.0-1 generics_0.1.3 lifecycle_1.0.4 cli_3.6.5
[13] vctrs_0.6.5 compiler_4.5.0 purrr_1.0.4 pillar_1.10.2
[17] rlang_1.1.6
>
> vert <- read_parquet("vertices.parquet")
> str(vert)
tibble [10,000,000 × 1] (S3: tbl_df/tbl/data.frame)
$ id: int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
>
> con <- open_dataset("edges.parquet")
> con
FileSystemDataset with 1 Parquet file
3 columns
src: int32
dst: int32
type: int32
See $metadata for additional Schema metadata
>
> dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect()
>
> nrow(dta)
[1] 200000000
>
>
Command being timed: "R --no-save"
User time (seconds): 30.19
System time (seconds): 2.74
Percent of CPU this job got: 306%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.75
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7078692
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 443
Minor (reclaiming a frame) page faults: 61785
Voluntary context switches: 10143
Involuntary context switches: 10536
Swaps: 0
File system inputs: 3898808
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
Below the results for v20:
```
$ /usr/bin/time -v R --no-save < test.R
R version 4.5.0 (2025-04-11) -- "How About a Twenty-Six"
Copyright (C) 2025 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
>
> library(arrow)
Attaching package: ‘arrow’
The following object is masked from ‘package:utils’:
timestamp
> library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
>
> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Debian GNU/Linux 12 (bookworm)
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0 LAPACK version
3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.4 arrow_20.0.0
loaded via a namespace (and not attached):
[1] assertthat_0.2.1 R6_2.6.1 bit_4.6.0 tidyselect_1.2.1
[5] magrittr_2.0.3 glue_1.8.0 tibble_3.2.1 pkgconfig_2.0.3
[9] bit64_4.6.0-1 generics_0.1.3 lifecycle_1.0.4 cli_3.6.5
[13] vctrs_0.6.5 compiler_4.5.0 purrr_1.0.4 pillar_1.10.2
[17] rlang_1.1.6
>
> vert <- read_parquet("vertices.parquet")
> str(vert)
tibble [10,000,000 × 1] (S3: tbl_df/tbl/data.frame)
$ id: int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
>
> con <- open_dataset("edges.parquet")
> con
FileSystemDataset with 1 Parquet file
3 columns
src: int32
dst: int32
type: int32
See $metadata for additional Schema metadata
>
> dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect()
Command terminated by signal 9
Command being timed: "R --no-save"
User time (seconds): 50.57
System time (seconds): 8.59
Percent of CPU this job got: 78%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:15.57
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 21869392
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 622
Minor (reclaiming a frame) page faults: 2185479
Voluntary context switches: 2431
Involuntary context switches: 1599
Swaps: 0
File system inputs: 140744
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
So the run with v17 took 10s and a maximum of approx 7GB memory. The v20
run was killed after 1m16s because it ran out of memory (my home machine is
unfortunately limited to 24GB). Before being killed the memory use peaked at
approx 22GB.
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]