[ 
https://issues.apache.org/jira/browse/ARROW-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557215#comment-17557215
 ] 

Hideaki Hayashi commented on ARROW-16578:
-----------------------------------------

Created a PR for a possible fix.

[https://github.com/apache/arrow/pull/13415]

It seems that the Elt call keeps working with un-materialized array, and when 
it is repeated for all the elements of the array, like R's unique() do, it ends 
up being expensive.

Here I'm materializing the array at the first call to Elt, and at least in this 
case, the result seems much better.

I also thought about something like 3-strike rule, but took the simple approach 
here.

Can this be a valid solution?

> [R] unique() and is.na() on a column of a tibble is much slower after writing 
> to and reading from a parquet file
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16578
>                 URL: https://issues.apache.org/jira/browse/ARROW-16578
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, R
>    Affects Versions: 7.0.0, 8.0.0
>            Reporter: Hideaki Hayashi
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> unique() on a column of a tibble is much slower after writing to and reading 
> from a parquet file.
> Here is a reprex.
> {{df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))}}
> {{write_parquet(df1,"/tmp/test.parquet")}}
> {{df2 <- read_parquet("/tmp/test.parquet")}}
> {{system.time(unique(df1$x))}}
> {{# Result on my late 2020 macbook pro with M1 processor:}}
> {{#   user  system elapsed }}
> {{#  0.020   0.000   0.021 }}
> {{system.time(unique(df2$x))}}
> {{#   user  system elapsed }}
> {{#  5.230   0.419   5.649 }}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to