debrouwere opened a new issue, #48908:
URL: https://github.com/apache/arrow/issues/48908

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I've noticed that serializing an arrow table (an ArrowTabular object) from R 
using `arrow::write_to_raw` can take about 10x the amount of time that it took 
to first read in the dataset from disk (just a regular nvme ssd).
   
   Not sure whether this counts as a bug report or a feature enhancement 
request, but in any case, this seems excessive and currently makes Arrow a 
no-go for inter-process communication in R, e.g. for parallel processing with 
the `mirai` package.
   
   Here's a minimal example:
   
   ```r
   library("arrow")
   library("profvis")
   
   data <- data.frame(i = rep(1:10, times=1e5))
   for (v in 1:100) {
     data[, paste0("v", v)] <- rnorm(1e6)
   }
   
   # 790 MiB on disk
   write_parquet(data, "sandbox/random.parquet")
   file.info("sandbox/random.parquet")$size / 1024 / 1024
   
   profvis({
     query <- open_dataset("sandbox/random.parquet")
     atbl <- as_arrow_table(query)                       # 70 ms
     tbl <- collect(atbl)                                # 10 ms
     ser <- arrow::write_to_raw(atbl, format = "stream") # 810 ms
     # - as.raw.Buffer                                   # (660 ms)  
     # - write_ipc_stream                                # (120 ms)
     # - buffer                                          # (20 ms)
     des <- read_ipc_stream(ser, as_data_frame = FALSE)  # 10 ms
   })
   ```
   
   As you can see, it takes 80 ms (70+10) to read the data into R, but 810 ms 
to serialize it for IPC. `as.raw.Buffer` seems to be the major culprit, but 
even `write_ipc_stream` takes more time than a full read from disk.
   
   I have observed this same behavior on a 2019 Macbook Air (MacOS, Intel) as 
well as on a 2025 workstation (Linux, AMD Zen 5). The speed is also similar 
whether using an R arrow 22.0 binary or a compiled arrow 24... with the latter 
being maybe a tad faster (600-650 ms instead of 800-810ms) but that could be 
noise in my benchmark.
   
   For completeness, here are the two package versions I tested with. The 
binary:
   
   ```
   Arrow package version: 22.0.0.1
   
   Capabilities:
                  
   acero      TRUE
   dataset    TRUE
   substrait FALSE
   parquet    TRUE
   json       TRUE
   s3         TRUE
   gcs        TRUE
   utf8proc   TRUE
   re2        TRUE
   snappy     TRUE
   gzip       TRUE
   brotli     TRUE
   zstd       TRUE
   lz4        TRUE
   lz4_frame  TRUE
   lzo       FALSE
   bz2        TRUE
   jemalloc   TRUE
   mimalloc   TRUE
   
   Memory:
                     
   Allocator mimalloc
   Current       5 Gb
   Max        5.76 Gb
   
   Runtime:
                             
   SIMD Level          avx512
   Detected SIMD Level avx512
   
   Build:
                              
   C++ Library Version  22.0.0
   C++ Compiler            GNU
   C++ Compiler Version  8.3.1
   ```
   
   ... and a version compiled using `install_arrow(nightly = TRUE)`
   
   ```r
   Arrow package version: 23.0.0.100000000
   
   Capabilities:
                  
   acero      TRUE
   dataset    TRUE
   substrait FALSE
   parquet    TRUE
   json       TRUE
   s3         TRUE
   gcs        TRUE
   utf8proc   TRUE
   re2        TRUE
   snappy     TRUE
   gzip       TRUE
   brotli     TRUE
   zstd       TRUE
   lz4        TRUE
   lz4_frame  TRUE
   lzo       FALSE
   bz2        TRUE
   jemalloc   TRUE
   mimalloc   TRUE
   
   Memory:
                     
   Allocator mimalloc
   Current    1.76 Gb
   Max        1.76 Gb
   
   Runtime:
                             
   SIMD Level          avx512
   Detected SIMD Level avx512
   
   Build:
                                       
   C++ Library Version  24.0.0-SNAPSHOT
   C++ Compiler                     GNU
   C++ Compiler Version          11.4.0
   ```
   
   As always, thanks for your help, I love arrow/parquet.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to