[jira] [Comment Edited] (ARROW-15081) [R][C++] Arrow crashes (OOM) on R client with large remote parquet files

Carl Boettiger (Jira) Mon, 02 May 2022 15:30:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530962#comment-17530962
 ]


Carl Boettiger edited comment on ARROW-15081 at 5/2/22 10:29 PM:
-----------------------------------------------------------------

Thanks Weston, I'll try that.  Just to make sure I'm testing the right thing, 
it will suffice to test the nightlies, 
arrow-7.0.0.20220501
With that version I still see high RAM use that leads to a crash (ie. after it 
exceeds the 50 GB RAM I allocate to my container), e.g. which should be 
reproducible with this example:
{code:java}
## 
library(arrow)
library(dplyr)
packageVersion("arrow")
path <- arrow::s3_bucket("ebird/Mar-2022/observations",
                         endpoint_override = "minio.carlboettiger.info",
                         anonymous=TRUE)
obs <- arrow::open_dataset(path) {code}
{code:java}
tmp <- obs |> 
  group_by(sampling_event_identifier, scientific_name) |>
  summarize(count = sum(observation_count, na.rm=TRUE),
            .groups = "drop") 
tmp <- tmp |> compute() # crashes
 {code}


was (Author: cboettig):
Thanks Weston, I'll try that.  Just to make sure I'm testing the right thing, 
it will suffice to test the nightlies, 
arrow-7.0.0.20220501
With that version I still see high RAM use that leads to a crash (ie. after it 
exceeds the 50 GB RAM I allocate to my container), e.g. which should be 
reproducible with this example:


{code:java}
## 
library(arrow)
library(dplyr)
packageVersion("arrow")
path <- arrow::s3_bucket("ebird/Mar-2022/observations",
                         endpoint_override = "minio.carlboettiger.info",
                         anonymous=TRUE)
obs <- arrow::open_dataset(path) tmp <- obs |> 
  group_by(sampling_event_identifier, scientific_name) |>
  summarize(count = sum(observation_count, na.rm=TRUE),
            .groups = "drop") 
tmp <- tmp |> compute() # crashes
 {code}

> [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
> ------------------------------------------------------------------------
>
>                 Key: ARROW-15081
>                 URL: https://issues.apache.org/jira/browse/ARROW-15081
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Assignee: Weston Pace
>            Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override = 
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller 
> files, which helps: 
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count() 
>  {code}
> (These parquet files have also been created by arrow, btw, from a single 
> large csv file provided by the original data provider (eBird).  Unfortunately 
> generating the partitioned versions is cumbersome as the data is very 
> unevenly distributed, there's few columns that can avoid creating 1000s of 
> parquet partition files and even so the bulk of the 1-billion rows fall 
> within the same group.  But all the same I think this is a bug as there's no 
> indication why arrow cannot handle a single 100GB parquet file I think?). 
>  
> Let me know if I can provide more info! I'm testing in R with latest CRAN 
> version of arrow on a machine with 200 GB RAM. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (ARROW-15081) [R][C++] Arrow crashes (OOM) on R client with large remote parquet files

Reply via email to