hideaki opened a new pull request, #13415:
URL: https://github.com/apache/arrow/pull/13415
Fixes ARROW-16578 "[R] unique() and is.na() on a column of a tibble is much
slower after writing to and reading from a parquet file".
Here I'm materializing the AltrepVectorString at the first call to Elt.
My thought is that it would make sense since it is likely that there will be
another call from R if there is one call (e.g. unique()), and also because
getting a string from Array seems to be much more costly than from data2.
Something like 3-strike rule may make sense too, but here in this PR, I'm
taking this simple approach.
ARROW-16578 reprex with the fix:
```
> df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
> write_parquet(df1,"/tmp/test.parquet")
> df2 <- read_parquet("/tmp/test.parquet")
> system.time(unique(df2$x))
user system elapsed
0.074 0.002 0.082
> system.time(unique(df1$x))
user system elapsed
0.022 0.001 0.025
> system.time(is.na(df2$x))
user system elapsed
0.006 0.001 0.006
> system.time(is.na(df1$x))
user system elapsed
0.003 0.000 0.004
```
devtools::test() result:
```
[ FAIL 0 | WARN 0 | SKIP 30 | PASS 7271 ]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]