thisisnic commented on issue #48057:
URL: https://github.com/apache/arrow/issues/48057#issuecomment-3498066335
Thanks for those profvis outputs @joeramirez, that's super handy, and that's
the best reprex I've been sent in a while!
Yeah, this is a bug; the biggest slowdown are where we're dealing with
schemas and calling `apply_arrow_r_metadata()`. There is some logic in that
function that checks if `r_metadata` is NULL, but when the original data.frame
is a raw data.frame we've ended up with metadata from the columns so it's
looping through it and thinking it's useful metadata.
As a workaround for now, if the data being saved is from R to Parquet, if
you call `df <- as_tibble(df)` before saving it to Parquet it'll speed things
up. When I did that I got much better results:
```
> microbenchmark::microbenchmark(fn_arrow(), fn_nano(), times = 20)
Unit: milliseconds
expr min lq mean median uq max neval
fn_arrow() 137.75471 143.7422 160.1568 148.1037 160.1099 340.6319 20
fn_nano() 65.45475 102.4707 216.1656 265.8206 282.6650 324.4787 20
```
We should fix this anyway though as we'd like to have comparable times when
the original data is a data.frame too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]