hideaki opened a new pull request, #13415:
URL: https://github.com/apache/arrow/pull/13415

   Fixes ARROW-16578 "[R] unique() and is.na() on a column of a tibble is much 
slower after writing to and reading from a parquet file".
   
   Here I'm materializing the AltrepVectorString at the first call to Elt.
   My thought is that it would make sense since it is likely that there will be 
another call from R if there is one call (e.g. unique()), and also because 
getting a string from Array seems to be much more costly than from data2.
   Something like 3-strike rule may make sense too, but here in this PR, I'm 
taking this simple approach.
   
   ARROW-16578 reprex with the fix:
   ```
   > df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
   > write_parquet(df1,"/tmp/test.parquet")
   > df2 <- read_parquet("/tmp/test.parquet")
   > system.time(unique(df2$x))
      user  system elapsed 
     0.074   0.002   0.082 
   > system.time(unique(df1$x))
      user  system elapsed 
     0.022   0.001   0.025 
   > system.time(is.na(df2$x))
      user  system elapsed 
     0.006   0.001   0.006 
   > system.time(is.na(df1$x))
      user  system elapsed 
     0.003   0.000   0.004 
   ```
   
   devtools::test() result:
   ```
   [ FAIL 0 | WARN 0 | SKIP 30 | PASS 7271 ]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to