debrouwere opened a new issue, #48908:
URL: https://github.com/apache/arrow/issues/48908
### Describe the bug, including details regarding any error messages,
version, and platform.
I've noticed that serializing an arrow table (an ArrowTabular object) from R
using `arrow::write_to_raw` can take about 10x the amount of time that it took
to first read in the dataset from disk (just a regular nvme ssd).
Not sure whether this counts as a bug report or a feature enhancement
request, but in any case, this seems excessive and currently makes Arrow a
no-go for inter-process communication in R, e.g. for parallel processing with
the `mirai` package.
Here's a minimal example:
```r
library("arrow")
library("profvis")
data <- data.frame(i = rep(1:10, times=1e5))
for (v in 1:100) {
data[, paste0("v", v)] <- rnorm(1e6)
}
# 790 MiB on disk
write_parquet(data, "sandbox/random.parquet")
file.info("sandbox/random.parquet")$size / 1024 / 1024
profvis({
query <- open_dataset("sandbox/random.parquet")
atbl <- as_arrow_table(query) # 70 ms
tbl <- collect(atbl) # 10 ms
ser <- arrow::write_to_raw(atbl, format = "stream") # 810 ms
# - as.raw.Buffer # (660 ms)
# - write_ipc_stream # (120 ms)
# - buffer # (20 ms)
des <- read_ipc_stream(ser, as_data_frame = FALSE) # 10 ms
})
```
As you can see, it takes 80 ms (70+10) to read the data into R, but 810 ms
to serialize it for IPC. `as.raw.Buffer` seems to be the major culprit, but
even `write_ipc_stream` takes more time than a full read from disk.
I have observed this same behavior on a 2019 Macbook Air (MacOS, Intel) as
well as on a 2025 workstation (Linux, AMD Zen 5). The speed is also similar
whether using an R arrow 22.0 binary or a compiled arrow 24... with the latter
being maybe a tad faster (600-650 ms instead of 800-810ms) but that could be
noise in my benchmark.
For completeness, here are the two package versions I tested with. The
binary:
```
Arrow package version: 22.0.0.1
Capabilities:
acero TRUE
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc TRUE
Memory:
Allocator mimalloc
Current 5 Gb
Max 5.76 Gb
Runtime:
SIMD Level avx512
Detected SIMD Level avx512
Build:
C++ Library Version 22.0.0
C++ Compiler GNU
C++ Compiler Version 8.3.1
```
... and a version compiled using `install_arrow(nightly = TRUE)`
```r
Arrow package version: 23.0.0.100000000
Capabilities:
acero TRUE
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc TRUE
Memory:
Allocator mimalloc
Current 1.76 Gb
Max 1.76 Gb
Runtime:
SIMD Level avx512
Detected SIMD Level avx512
Build:
C++ Library Version 24.0.0-SNAPSHOT
C++ Compiler GNU
C++ Compiler Version 11.4.0
```
As always, thanks for your help, I love arrow/parquet.
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]