[
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458660#comment-17458660
]
Weston Pace commented on ARROW-15081:
-------------------------------------
I agree this should work. I'll have to look at how we have count implemented
as I believe we shouldn't even have to look at the data in that case and I
thought we had some special paths in place for this.
> Arrow crashes (OOM) on R client with large remote parquet files
> ---------------------------------------------------------------
>
> Key: ARROW-15081
> URL: https://issues.apache.org/jira/browse/ARROW-15081
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Carl Boettiger
> Assignee: Weston Pace
> Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override =
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller
> files, which helps:
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count()
> {code}
> (These parquet files have also been created by arrow, btw, from a single
> large csv file provided by the original data provider (eBird). Unfortunately
> generating the partitioned versions is cumbersome as the data is very
> unevenly distributed, there's few columns that can avoid creating 1000s of
> parquet partition files and even so the bulk of the 1-billion rows fall
> within the same group. But all the same I think this is a bug as there's no
> indication why arrow cannot handle a single 100GB parquet file I think?).
>
> Let me know if I can provide more info! I'm testing in R with latest CRAN
> version of arrow on a machine with 200 GB RAM.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)