Carl Boettiger created ARROW-15081:
--------------------------------------

             Summary: Arrow crashes (OOM) on R client with large remote parquet 
files
                 Key: ARROW-15081
                 URL: https://issues.apache.org/jira/browse/ARROW-15081
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Carl Boettiger


The below should be a reproducible crash:


{code:java}
library(arrow)
library(dplyr)
server <- arrow::s3_bucket("ebird",endpoint_override = 
"minio.cirrus.carlboettiger.info")

path <- server$path("Oct-2021/observations")
obs <- arrow::open_dataset(path)

path$ls() # observe -- 1 parquet file

obs %>% count() # CRASH

obs %>% to_duckdb() # also crash{code}
I have attempted to split this large (~100 GB parquet file) into some smaller 
files, which helps: 


{code:java}
path <- server$path("partitioned")
obs <- arrow::open_dataset(path)
obs$ls() # observe, multiple parquet files now
obs %>% count() 
 {code}
(These parquet files have also been created by arrow, btw, from a single large 
csv file provided by the original data provider (eBird).  Unfortunately 
generating the partitioned versions is cumbersome as the data is very unevenly 
distributed, there's few columns that can avoid creating 1000s of parquet 
partition files and even so the bulk of the 1-billion rows fall within the same 
group.  But all the same I think this is a bug as there's no indication why 
arrow cannot handle a single 100GB parquet file I think?). 

 

Let me know if I can provide more info! I'm testing in R with latest CRAN 
version of arrow on a machine with 200 GB RAM. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to