[
https://issues.apache.org/jira/browse/ARROW-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557215#comment-17557215
]
Hideaki Hayashi commented on ARROW-16578:
-----------------------------------------
Created a PR for a possible fix.
[https://github.com/apache/arrow/pull/13415]
It seems that the Elt call keeps working with un-materialized array, and when
it is repeated for all the elements of the array, like R's unique() do, it ends
up being expensive.
Here I'm materializing the array at the first call to Elt, and at least in this
case, the result seems much better.
I also thought about something like 3-strike rule, but took the simple approach
here.
Can this be a valid solution?
> [R] unique() and is.na() on a column of a tibble is much slower after writing
> to and reading from a parquet file
> ----------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-16578
> URL: https://issues.apache.org/jira/browse/ARROW-16578
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, R
> Affects Versions: 7.0.0, 8.0.0
> Reporter: Hideaki Hayashi
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> unique() on a column of a tibble is much slower after writing to and reading
> from a parquet file.
> Here is a reprex.
> {{df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))}}
> {{write_parquet(df1,"/tmp/test.parquet")}}
> {{df2 <- read_parquet("/tmp/test.parquet")}}
> {{system.time(unique(df1$x))}}
> {{# Result on my late 2020 macbook pro with M1 processor:}}
> {{# user system elapsed }}
> {{# 0.020 0.000 0.021 }}
> {{system.time(unique(df2$x))}}
> {{# user system elapsed }}
> {{# 5.230 0.419 5.649 }}
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)