Hi R users.
Apologies for the lack of concrete examples because the dataset is large, and
it being so I believe is the issue.
I multiple, very large datasets for which I need to generate 0/1
absence/presence columns
Some include over 200M rows, with two columns that need presence/absence
columns based on the strings contained within them, as an example, one set has
~29k unique values and the other with ~15k unique values (no overlap across the
two).
Using a combination of custom functions:
crewjanitormakeclean <- function(df,columns) {
df <- df |> mutate(across(columns, ~make_clean_names(., allow_dupes = TRUE)))
return(df)
}
mass_pivot_wider <- function(df,column,prefix) {
df <- df |> distinct() |> mutate(n = 1) |> pivot_wider(names_from =
glue("{column}"), values_from = n, names_prefix = prefix, values_fill = list(n
= 0))
return(df)
}
sum_group_function <- function(df) {
df <- df |> group_by(ID_Key) |>
summarise(across(c(starts_with("column1_name_"),starts_with("column2_name_"),),
~ sum(.x, na.rm = TRUE))) |> ungroup()
return(df)
}
and splitting up the data into a list of 110k individual dataframes based on
Key_ID
temp <-
open_dataset(
sources = input_files,
format = 'csv',
unify_schema = TRUE,
col_types = schema(
"ID_Key" = string(),
"column1" = string(),
"column1" = string()
)
) |> as_tibble()
keeptabs <- split(temp, temp$ID_Key)
I used a multicore framework to distribute the `sum` functions across each
Key_ID when a multicore argument is enabled.
if(isTRUE(multicore)){
output <- mclapply(1:length(modtabs), function(i)
crewjanitormakeclean(modtabs[[i]],c("string_columns_2","string_columns_1")),
mc.cores = numcores)
output <- mclapply(1:length(modtabs), function(i)
mass_pivot_wider(modtabs[[i]],"string_columns_1","col_1_name_"), mc.cores =
numcores)
output <- mclapply(1:length(modtabs), function(i)
mass_pivot_wider(modtabs[[i]],"string_columns_2","col_2_name_"), mc.cores =
numcores)
}else{
output <- lapply(1:length(modtabs), function(i)
crewjanitormakeclean(modtabs[[i]],c("string_columns_2","string_columns_1")))
output <- lapply(1:length(modtabs), function(i)
mass_pivot_wider(modtabs[[i]],"string_columns_1","col_1_name_"))
output <- mclapply(1:length(modtabs), function(i)
mass_pivot_wider(modtabs[[i]],"string_columns_2","col_2_name_"))
}
Moving every Key_ID to a single row and then row-binding the data while
creating new columns for the differences across `Key_ID`s from the pivot using
the following solution (78 upvotes at time of this email):
https://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns
allNms <- unique(unlist(lapply(keeptabs, names)))
output <- do.call(rbind,
c(lapply(keeptabs,
function(x) data.frame(c((x),
sapply(setdiff(allNms, names(x)),
function(y)
NA))) |> as_tibble()),
make.row.names=FALSE)) |>
mutate(across(c(starts_with("column1_name_"), starts_with("column2_name_")),
coalesce, 0))
However, I have noticed that the jobs seem to "hang" after a while, with the
initial 30 or so (numcores == 30 in the workflow, equal to the number of cores
I reserved) at 100% of the requested CPU, and then several "zombie" processes
occur and the cores just stop at 0% and never proceed, usually dying with a
timeout or not all jobs running to completion failure to join of some kind.
This happens in both base R and RStudio, and I haven't been able to figure out
if it's something wrong with the code, the size of the data, or our
architecture, but I would appreciate any suggestions as to what I might be able
to do about this.
Before they are suggested, I have also tried this same approach with foreach,
snow, future, and furr packages, and base parallel with mc.apply seems to be
the only thing that works for at least one dataset.
In the event it has something to do with our architecture, here is what we are
running on and our loaded packages:
OS Information:
NAME="Red Hat Enterprise Linux"
VERSION="9.3 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
Operating System: Red Hat Enterprise Linux 9.3 (Plow)
CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
Kernel: Linux 5.14.0-362.13.1.el9_3.x86_64
Architecture: x86-64
Hardware Vendor: Dell Inc.
Hardware Model: PowerEdge R840
Firmware Version: 2.15.1
R Version:
R.Version()
$platform
[1] "x86_64-pc-linux-gnu"
$arch
[1] "x86_64"
$os
[1] "linux-gnu"
$system
[1] "x86_64, linux-gnu"
$status
[1] ""
$major
[1] "4"
$minor
[1] "3.2"
$year
[1] "2023"
$month
[1] "10"
$day
[1] "31"
$`svn rev`
[1] "85441"
$language
[1] "R"
$version.string
[1] "R version 4.3.2 (2023-10-31)"
$nickname
[1] "Eye Holes"
RStudio Server Version:
RStudio 2023.09.1+494 "Desert Sunflower" Release
(cd7011dce393115d3a7c3db799dda4b1c7e88711, 2023-10-16) for RHEL 9
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/130.0.0.0 Safari/537.36
PPM Repo:
https://packagemanager.posit.co/cran/__linux__/rhel9/latest
attached base packages:
[1] parallel stats graphics grDevices datasets utils methods base
other attached packages:
[1] listenv_0.9.1 microbenchmark_1.5.0 dbplyr_2.4.0
duckplyr_0.4.1 readxl_1.4.3 fastDummies_1.7.3
[7] glue_1.8.0 arrow_14.0.2.1 data.table_1.15.2
toolbox_0.1.1 janitor_2.2.0 lubridate_1.9.3
[13] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
readr_2.1.5 tidyr_1.3.1
[19] tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0
duckdb_1.1.2 DBI_1.2.3 fs_1.6.3
Happy to provide additional information if it would be helpful.
Thank you in advance!
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham
Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to
continue communication over unencrypted e-mail, please notify the sender of
this message immediately. Continuing to send or respond to e-mail after
receiving this message means you understand and accept this risk and wish to
continue to communicate over unencrypted e-mail.
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.