thisisnic opened a new issue, #40224:
URL: https://github.com/apache/arrow/issues/40224
### Describe the bug, including details regarding any error messages,
version, and platform.
I'm trying to repartition a ~10Gb dataset based on a new variable, but I
can't work out whether this is a bug or expected behaviour due to how things
are implemented internally. Here's the R code I've been running:
```
open_dataset("data/pums/person") |>
mutate(
age_group = case_when(
AGEP < 25 ~ "Under 25",
AGEP < 35 ~ "25-34",
AGEP < 45 ~ "35-44",
AGEP < 55 ~ "45-54",
AGEP < 65 ~ "55-64",
TRUE ~ "65+"
)
)|>
write_dataset(
path = "./data/pums/person-age-partitions",
partitioning = c("year", "location", "age_group")
)
```
The data is in Parquet format and is already partitioned by "year" and
"location". When I try to run this, it gradually uses more and more of my RAM
until it crashes.
If I run it with the debugger attached, it all looks fine, but eventually
dies with the message `Program terminated with signal SIGKILL, Killed.`
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]