Carl Boettiger created ARROW-15081:
--------------------------------------
Summary: Arrow crashes (OOM) on R client with large remote parquet
files
Key: ARROW-15081
URL: https://issues.apache.org/jira/browse/ARROW-15081
Project: Apache Arrow
Issue Type: Bug
Components: R
Reporter: Carl Boettiger
The below should be a reproducible crash:
{code:java}
library(arrow)
library(dplyr)
server <- arrow::s3_bucket("ebird",endpoint_override =
"minio.cirrus.carlboettiger.info")
path <- server$path("Oct-2021/observations")
obs <- arrow::open_dataset(path)
path$ls() # observe -- 1 parquet file
obs %>% count() # CRASH
obs %>% to_duckdb() # also crash{code}
I have attempted to split this large (~100 GB parquet file) into some smaller
files, which helps:
{code:java}
path <- server$path("partitioned")
obs <- arrow::open_dataset(path)
obs$ls() # observe, multiple parquet files now
obs %>% count()
{code}
(These parquet files have also been created by arrow, btw, from a single large
csv file provided by the original data provider (eBird). Unfortunately
generating the partitioned versions is cumbersome as the data is very unevenly
distributed, there's few columns that can avoid creating 1000s of parquet
partition files and even so the bulk of the 1-billion rows fall within the same
group. But all the same I think this is a bug as there's no indication why
arrow cannot handle a single 100GB parquet file I think?).
Let me know if I can provide more info! I'm testing in R with latest CRAN
version of arrow on a machine with 200 GB RAM.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)