[
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530962#comment-17530962
]
Carl Boettiger edited comment on ARROW-15081 at 5/2/22 10:29 PM:
-----------------------------------------------------------------
Thanks Weston, I'll try that. Just to make sure I'm testing the right thing,
it will suffice to test the nightlies,
arrow-7.0.0.20220501
With that version I still see high RAM use that leads to a crash (ie. after it
exceeds the 50 GB RAM I allocate to my container), e.g. which should be
reproducible with this example:
{code:java}
##
library(arrow)
library(dplyr)
packageVersion("arrow")
path <- arrow::s3_bucket("ebird/Mar-2022/observations",
endpoint_override = "minio.carlboettiger.info",
anonymous=TRUE)
obs <- arrow::open_dataset(path) {code}
{code:java}
tmp <- obs |>
group_by(sampling_event_identifier, scientific_name) |>
summarize(count = sum(observation_count, na.rm=TRUE),
.groups = "drop")
tmp <- tmp |> compute() # crashes
{code}
was (Author: cboettig):
Thanks Weston, I'll try that. Just to make sure I'm testing the right thing,
it will suffice to test the nightlies,
arrow-7.0.0.20220501
With that version I still see high RAM use that leads to a crash (ie. after it
exceeds the 50 GB RAM I allocate to my container), e.g. which should be
reproducible with this example:
{code:java}
##
library(arrow)
library(dplyr)
packageVersion("arrow")
path <- arrow::s3_bucket("ebird/Mar-2022/observations",
endpoint_override = "minio.carlboettiger.info",
anonymous=TRUE)
obs <- arrow::open_dataset(path) tmp <- obs |>
group_by(sampling_event_identifier, scientific_name) |>
summarize(count = sum(observation_count, na.rm=TRUE),
.groups = "drop")
tmp <- tmp |> compute() # crashes
{code}
> [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
> ------------------------------------------------------------------------
>
> Key: ARROW-15081
> URL: https://issues.apache.org/jira/browse/ARROW-15081
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Carl Boettiger
> Assignee: Weston Pace
> Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override =
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller
> files, which helps:
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count()
> {code}
> (These parquet files have also been created by arrow, btw, from a single
> large csv file provided by the original data provider (eBird). Unfortunately
> generating the partitioned versions is cumbersome as the data is very
> unevenly distributed, there's few columns that can avoid creating 1000s of
> parquet partition files and even so the bulk of the 1-billion rows fall
> within the same group. But all the same I think this is a bug as there's no
> indication why arrow cannot handle a single 100GB parquet file I think?).
>
> Let me know if I can provide more info! I'm testing in R with latest CRAN
> version of arrow on a machine with 200 GB RAM.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)