Re: [I] [R][Parquet] arrow::read_parquet is very slow, compared to nanoparquet::read_parquet [arrow]

via GitHub Thu, 06 Nov 2025 08:05:07 -0800


thisisnic commented on issue #48057:
URL: https://github.com/apache/arrow/issues/48057#issuecomment-3498066335


   Thanks for those profvis outputs @joeramirez, that's super handy, and that's 
the best reprex I've been sent in a while!
   
   Yeah, this is a bug; the biggest slowdown are where we're dealing with 
schemas and calling `apply_arrow_r_metadata()`. There is some logic in that 
function that checks if `r_metadata` is NULL, but when the original data.frame 
is a raw data.frame we've ended up with metadata from the columns so it's 
looping through it and thinking it's useful metadata.
   
   As a workaround for now, if the data being saved is from R to Parquet, if 
you call `df <- as_tibble(df)` before saving it to Parquet it'll speed things 
up. When I did that I got much better results:
   
   ```
   > microbenchmark::microbenchmark(fn_arrow(), fn_nano(), times = 20)
   Unit: milliseconds
          expr       min       lq     mean   median       uq      max neval
    fn_arrow() 137.75471 143.7422 160.1568 148.1037 160.1099 340.6319    20
     fn_nano()  65.45475 102.4707 216.1656 265.8206 282.6650 324.4787    20
   ```
   
   We should fix this anyway though as we'd like to have comparable times when 
the original data is a data.frame too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R][Parquet] arrow::read_parquet is very slow, compared to nanoparquet::read_parquet [arrow]

Reply via email to