[ 
https://issues.apache.org/jira/browse/ARROW-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537052#comment-17537052
 ] 

Jonathan Keane commented on ARROW-16578:
----------------------------------------

Thanks for the report + very clear reprex. 

I see what's going on here, which isn't totally what I was expecting. What's 
happening is that the arrow uses altrep when reading from arrow tables (which 
happens when one reads from parquet like this). Because of that, when you call 
{{unique()}} on the column here, that includes the time that it takes to 
translate from arrow's representation to R's (which is moderately expensive for 
strings as you can see here!). 

But something still isn't quite right here, because I would expect subsequent 
calls to the same column to be much shorter (basically the same time as the 
call against {{df1$x}} there.

{code:r}
library(arrow)

df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
write_parquet(df1,"/tmp/test.parquet")
df2 <- read_parquet("/tmp/test.parquet")

system.time(unique(df2$x))
#>    user  system elapsed 
#>  13.285   3.854  17.284
{code}

And we can check that this column is indeed still altrep when I would have 
expected that it would be materialized at this point (since {{unique()}} above 
caused it to be), but it still isn't:
{code:r}
arrow:::is_arrow_altrep(df2$x)
#> [1] TRUE

col_altrep <- df2$x

arrow:::is_arrow_altrep(col_altrep)
#> [1] TRUE

system.time(unique(col_altrep))
#>    user  system elapsed 
#>  12.150   2.760  15.003
{code}

But if we do fully materialize it, we see a much much faster {{unique()}}:

{code:r}
col_not_altrep <- df2$x[1:nrow(df2)]

arrow:::is_arrow_altrep(col_not_altrep)
#> [1] FALSE

system.time(unique(col_not_altrep))
#>    user  system elapsed 
#>   0.011   0.002   0.013
{code}

TL;DR, we do expect the first call to `unique()` in this circumstance to be 
longer (because we shift the time cost of materializing the data from the 
parquet file from the reading part to the first call that requires 
materialization). But we've got something else going wrong because we aren't 
maintaining the materialization like we should be.

> [R] unique() and is.na() on a column of a tibble is much slower after writing 
> to and reading from a parquet file
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16578
>                 URL: https://issues.apache.org/jira/browse/ARROW-16578
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, R
>    Affects Versions: 7.0.0, 8.0.0
>            Reporter: Hideaki Hayashi
>            Priority: Major
>
> unique() on a column of a tibble is much slower after writing to and reading 
> from a parquet file.
> Here is a reprex.
> {{df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))}}
> {{write_parquet(df1,"/tmp/test.parquet")}}
> {{df2 <- read_parquet("/tmp/test.parquet")}}
> {{system.time(unique(df1$x))}}
> {{# Result on my late 2020 macbook pro with M1 processor:}}
> {{#   user  system elapsed }}
> {{#  0.020   0.000   0.021 }}
> {{system.time(unique(df2$x))}}
> {{#   user  system elapsed }}
> {{#  5.230   0.419   5.649 }}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to