I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to socketConnection() as a character, i.e. there's a missing timeout <- as.integer(timeout), cf. port <- as.integer(port) and useXDR <- as.logical(value):
> parallel:::.slaveRSOCK function () { makeSOCKmaster <- function(master, port, timeout, useXDR) { port <- as.integer(port) con <- socketConnection(master, port = port, blocking = TRUE, open = "a+b", timeout = timeout) structure(list(con = con), class = if (useXDR) "SOCKnode" else "SOCK0node") } master <- "localhost" port <- NA_integer_ outfile <- Sys.getenv("R_SNOW_OUTFILE") methods <- TRUE useXDR <- TRUE for (a in commandArgs(TRUE)) { pos <- regexpr("=", a) name <- substr(a, 1L, pos - 1L) value <- substr(a, pos + 1L, nchar(a)) switch(name, MASTER = { master <- value }, PORT = { port <- value }, OUT = { outfile <- value }, TIMEOUT = { timeout <- value }, XDR = { useXDR <- as.logical(value) }) } if (is.na(port)) stop("PORT must be specified") sinkWorkerOutput(outfile) msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(), paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3")) cat(msg) slaveLoop(makeSOCKmaster(master, port, timeout, useXDR)) } <bytecode: 0x4bd4b58> <environment: namespace:parallel> Yet, fix that does *not* seem to change anything. /Henrik On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson <henrik.bengts...@gmail.com> wrote: > BACKGROUND: > > While troubleshooting random, occasionally occurring, errors from > parallel::makePSOCKcluster("localhost", port = 11000); > > Error in socketConnection("localhost", port = port, server = TRUE, > blocking = TRUE, : > cannot open the connection > > I had another look at parallel:::newPSOCKnode(), which is used > internally to set up each background worker. It is designed to, first > launch the background worker as: > > system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args > MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE', > wait = FALSE) > > which immediately tries to connect to a socket on localhost:11000 with > timeout. Immediately after the master launched the above (without > waiting), it will set up the connection waiting for the connect from > the background worker: > > con <- socketConnection("localhost", port = 11000, server = TRUE, > blocking = TRUE, open = "a+b", timeout = timeout) > > > ISSUE: > > If we emulate the above process, and remove the OUT=/dev/null such > that we can see the output produces by the worker, as: > > setup <- function(delay = 0) { > system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args > MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE) > Sys.sleep(delay) > socketConnection("localhost", port = 11000, server = TRUE, blocking > = TRUE, open = "a+b", timeout = 20) > } > > doing: > >> con <- setup(0) > starting worker pid=24983 on localhost:11000 at 18:44:30.087 > > will most likely work, but adding a delay: > >> con <- setup(5) > starting worker pid=25099 on localhost:11000 at 18:45:23.617 > Warning in socketConnection(master, port = port, blocking = TRUE, open > = "a+b", : > localhost:11000 cannot be opened > Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", > : > cannot open the connection > Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> > socketConnection > > will produce an *instant* error on the worker, and before master opens > the server socket. Eventually, master will produce the originally > observed error: > > Error in socketConnection("localhost", port = 11000, server = TRUE, > blocking = TRUE, : > cannot open the connection > > In other words, if the master fails to setup socketConnection() > *before* the background workers attempts to connect, it all fails. > Such a delay may happen for instance when there is a large CPU load on > the test machine. > > Is this the above bug? > > /Henrik > > PS. The background is that I, very occasionally, observe R CMD check > error on the above (on CRAN and elsewhere) when testing my future > package. The error always go away when retested. This far I've though > this is due to port clashes (since the port is random selected in > [11000:11999]) and accepted that it happens. However, after > discovering the above, it could be due to the worker launching "too > soon". > >> sessionInfo() > R version 3.4.3 (2017-11-30) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 16.04.4 LTS > > Matrix products: default > BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 > LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.4.3 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel