[
https://issues.apache.org/jira/browse/ARROW-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537052#comment-17537052
]
Jonathan Keane commented on ARROW-16578:
----------------------------------------
Thanks for the report + very clear reprex.
I see what's going on here, which isn't totally what I was expecting. What's
happening is that the arrow uses altrep when reading from arrow tables (which
happens when one reads from parquet like this). Because of that, when you call
{{unique()}} on the column here, that includes the time that it takes to
translate from arrow's representation to R's (which is moderately expensive for
strings as you can see here!).
But something still isn't quite right here, because I would expect subsequent
calls to the same column to be much shorter (basically the same time as the
call against {{df1$x}} there.
{code:r}
library(arrow)
df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
write_parquet(df1,"/tmp/test.parquet")
df2 <- read_parquet("/tmp/test.parquet")
system.time(unique(df2$x))
#> user system elapsed
#> 13.285 3.854 17.284
{code}
And we can check that this column is indeed still altrep when I would have
expected that it would be materialized at this point (since {{unique()}} above
caused it to be), but it still isn't:
{code:r}
arrow:::is_arrow_altrep(df2$x)
#> [1] TRUE
col_altrep <- df2$x
arrow:::is_arrow_altrep(col_altrep)
#> [1] TRUE
system.time(unique(col_altrep))
#> user system elapsed
#> 12.150 2.760 15.003
{code}
But if we do fully materialize it, we see a much much faster {{unique()}}:
{code:r}
col_not_altrep <- df2$x[1:nrow(df2)]
arrow:::is_arrow_altrep(col_not_altrep)
#> [1] FALSE
system.time(unique(col_not_altrep))
#> user system elapsed
#> 0.011 0.002 0.013
{code}
TL;DR, we do expect the first call to `unique()` in this circumstance to be
longer (because we shift the time cost of materializing the data from the
parquet file from the reading part to the first call that requires
materialization). But we've got something else going wrong because we aren't
maintaining the materialization like we should be.
> [R] unique() and is.na() on a column of a tibble is much slower after writing
> to and reading from a parquet file
> ----------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-16578
> URL: https://issues.apache.org/jira/browse/ARROW-16578
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, R
> Affects Versions: 7.0.0, 8.0.0
> Reporter: Hideaki Hayashi
> Priority: Major
>
> unique() on a column of a tibble is much slower after writing to and reading
> from a parquet file.
> Here is a reprex.
> {{df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))}}
> {{write_parquet(df1,"/tmp/test.parquet")}}
> {{df2 <- read_parquet("/tmp/test.parquet")}}
> {{system.time(unique(df1$x))}}
> {{# Result on my late 2020 macbook pro with M1 processor:}}
> {{# user system elapsed }}
> {{# 0.020 0.000 0.021 }}
> {{system.time(unique(df2$x))}}
> {{# user system elapsed }}
> {{# 5.230 0.419 5.649 }}
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)