Thank you Tomas. For the record, I'm confirming that the stray background R worker process now times out properly after 'setup_timeout' (= 120) seconds:
{0s}$ Rscript -e 'parallel::makeCluster(1L, port=80)' Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : cannot open the connection Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection In addition: Warning message: In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : port 80 cannot be opened Execution halted {1s}$ ps aux | grep -E "exec[/]R" hb 17645 2.0 0.3 259104 55144 pts/5 S 20:58 0:00 /home/hb/software/R-devel/trunk/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK() --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE {2s}$ sleep 120 {122s}$ ps aux | grep -E "exec[/]R" {122s}$ Good spotting of the bug: - if (Sys.time() - t0 > setup_timeout) break + if (difftime(Sys.time(), t0, units="secs") > setup_timeout) break For those who find this thread, I think what's going on here is that 'setup_timeout = 120' is a numeric that is compared a 'difftime' than keeps changing unit as times goes by. When compared as 'Sys.time() - t0 > setup_timeout' the LHS would be in units of seconds as long as less than 60 seconds had passed: > Sys.time() - t0 Time difference of 59 secs > as.numeric(Sys.time() - t0) [1] 59 However, as soon as more than 60 seconds has passed, the unit turns into minutes and we're comparing minutes to seconds: > Sys.time() - t0 Time difference of 1.016667 mins > as.numeric(Sys.time() - t0) [1] 1.016667 which is now compared to 'setup_timeout'. If the unit remained to be minutes it would timeout after 120 [minutes]. However, after 120 minutes, the unit of Sys.time() - t0 is in hours, and we're comparing hours to seconds, and so on. It would only timeout if we used 'setup_timeout' < 60 seconds. /Henrik On Wed, Mar 27, 2019 at 12:52 PM Tomas Kalibera <tomas.kalib...@gmail.com> wrote: > > > The problem causing the stray worker processes when the master fails to > open a server socket to listen to connections from workers is not > related to timeout in socketConnection(), because socketConnection() > will fail right away. It is caused by a bug in checking the setup > timeout (PR 17391). > > Fixed in 76275. > > Best > Tomas > > On 3/18/19 2:23 AM, Henrik Bengtsson wrote: > > (Bcc: CRAN) > > > > This is a proposal helping CRAN and alike as well as individual > > developers to avoid stray R processes being left behind that might be > > produced when an example or a package test fails to set up a > > parallel::makeCluster(). > > > > > > ISSUE > > > > If a package test sets up a PSOCK cluster and then the master process > > dies for one reason or the other, the PSOCK worker processes will > > remain running for 30 days ('timeout') until they timeout and > > terminate that way. When this happens on CRAN servers, where many > > packages are checked all the time, this will result in a lot of stray > > R processes. > > > > Here is an example illustrating how R leaves behind stray R processes > > if fails to establish a connection to one or more background R > > processes launched by 'parallel::makeCluster()'. First, let's make > > sure there are no other R processes running: > > > > $ ps aux | grep -E "exec[/]R" > > > > Then, lets create a PSOCK cluster for which connection will fail > > (because port 80 is reserved): > > > > $ Rscript -e 'parallel::makeCluster(1L, port=80)' > > Error in socketConnection("localhost", port = port, server = TRUE, > > blocking = TRUE, : > > cannot open the connection > > Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> > > socketConnection > > In addition: Warning message: > > In socketConnection("localhost", port = port, server = TRUE, > > blocking = TRUE, : > > port 80 cannot be opened > > > > The launched R worker is still running: > > > > $ ps aux | grep -E "exec[/]R" > > hb 20778 37.0 0.4 283092 70624 pts/0 S 17:50 0:00 > > /usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK() > > --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120 > > TIMEOUT=2 592000 XDR=TRUE > > > > This process will keep running for 'TIMEOUT=2592000' seconds (= 30 > > days). The reason for this is that it is currently in the state where > > it attempts to set up a connection to the main R process: > > > > > parallel:::.slaveRSOCK > > function () > > { > > makeSOCKmaster <- function(master, port, setup_timeout, timeout, > > useXDR) { > > ... > > repeat { > > con <- tryCatch({ > > socketConnection(master, port = port, blocking = TRUE, > > open = "a+b", timeout = timeout) > > }, error = identity) > > ... > > } > > > > In other words, it is stuck in 'socketConnection()' and it won't time > > out until 'timeout' seconds. > > > > > > SUGGESTION > > > > To mitigate the problem with above stray processes from running 'R CMD > > check', we could shorten the 'timeout' which is currently hardcoded to > > 30 days (src/library/parallel/R/snow.R). By making it possible to > > control the default via environment variables, e.g. > > > > setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 > > * 2)), # 2 minutes > > timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60 > > * 24 * 30)), # 30 days > > > > it would be straightforward to adjust `R CMD check` to use, say, > > > > R_PARALLEL_SETUP_TIMEOUT=60 > > > > by default. This would cause any stray processes to time out after 60 > > seconds (instead of 30 days as now). > > > > /Henrik > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel