[I] [R][C++] Repartitioning on a new variable uses all my RAM and crashes [arrow]

via GitHub Sat, 24 Feb 2024 07:54:02 -0800


thisisnic opened a new issue, #40224:
URL: https://github.com/apache/arrow/issues/40224


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I'm trying to repartition a ~10Gb dataset based on a new variable, but I 
can't work out whether this is a bug or expected behaviour due to how things 
are implemented internally.  Here's the R code I've been running:
   
   ```
   open_dataset("data/pums/person") |>
     mutate(
       age_group = case_when(
         AGEP < 25 ~ "Under 25",
         AGEP < 35 ~ "25-34",
         AGEP < 45 ~ "35-44",
         AGEP < 55 ~ "45-54",
         AGEP < 65 ~ "55-64",
         TRUE ~ "65+"
       )
     )|>
     write_dataset(
       path = "./data/pums/person-age-partitions",
       partitioning = c("year", "location", "age_group")
     )
   ```
   
   The data is in Parquet format and is already partitioned by "year" and 
"location".  When I try to run this, it gradually uses more and more of my RAM 
until it crashes.
   
   If I run it with the debugger attached, it all looks fine, but eventually 
dies with the message `Program terminated with signal SIGKILL, Killed.`
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [R][C++] Repartitioning on a new variable uses all my RAM and crashes [arrow]

Reply via email to