Good point Kasper, I hadn’t even considered the possibility of that. This has opened an unpleasant box of worms...
I haven’t noticed any issues with MT, but I’ve just switched my DropletUtils C++ code from boost::random::mt19937 to dqrng’s pcg32 to avoid potential problems with overlaps between streams. And to avoid MT seeding issues (32-bit seed for a 19937-bit state). -A > On 8 Jan 2019, at 03:45, Kasper Daniel Hansen <kasperdanielhan...@gmail.com> > wrote: > > To add to Henrik's comments, it is also worthwhile to recognize that > mclapply() does not deliver statistically sound random numbers even within > the apply "loop" unless you use RNGkind("L'Ecuyer-CMRG") which is not set > as default. This is because mclapply will initialize random streams with > different seeds and most RNGs does not guarantee that different seeds leads > to independent streams (but L'Ecuyer does). > > Reproducibility is nice, but so is statistical correctness. > > Best, > Kasper > > > On Mon, Jan 7, 2019 at 6:26 PM Henrik Bengtsson <henrik.bengts...@gmail.com> > wrote: > >> On Mon, Jan 7, 2019 at 4:09 AM Martin Morgan <mtmorgan.b...@gmail.com> >> wrote: >>> >>> I hope for 1. to have a 'local socket' (i.e., not using ports) >> implementation shortly. >>> >>> I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We >> now have >>> >>>> library(BiocParallel) >>>> set.seed(1); p = bpparam(); rnorm(1) >>> [1] -0.6264538 >>>> set.seed(1); p = bpparam(); rnorm(1) >>> [1] -0.6264538 >>> >>> at the expensive of using the generator when the package is loaded. >>> >>>> set.seed(1); rnorm(1) >>> [1] -0.6264538 >>>> set.seed(1); library(BiocParallel); rnorm(1) >>> [1] 0.1836433 >>> >>> Is that bad? It will be consistent across platforms. >> >> Contrary to resampling methods etc., the RNG for generating random >> ports does _not_ have to be statistically sound. Because of this, you >> should be able to generate a random port while *preserving the RNG >> state*, i.e. undoing .GlobalEnv$.Random.seed (including removing if in >> a fresh R session). I decided to add this approach for >> future::makeClusterPSOCK() >> [ >> https://github.com/HenrikBengtsson/future/blob/073e1d1/R/makeClusterPSOCK.R#L76-L82 >> ]. >> >> BTW, here's a useful development tool for detecting when the >> .Random.seed has been updated in an interactive session: >> >> addTaskCallback(local({ >> last <- .GlobalEnv$.Random.seed >> function(...) { >> curr <- .GlobalEnv$.Random.seed >> if (!identical(curr, last)) { >> warning(".Random.seed changed", call. = FALSE, immediate. = TRUE) >> last <<- curr >> } >> TRUE >> } >> }), name = "RNG tracker") >> >>> >>> This behavior >>> >>>> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1))) >>> [1] 0.9624337 0.8925947 >>>> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1))) >>> [1] -0.5703597 0.1102093 >>> >>> seems wrong, but is consistent with mclapply >>> >>>> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1))) >>> [1] -0.02704527 0.40721777 >>>> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1))) >>> [1] -0.8239765 1.2957928 >> >> FWIW, without having invested much time tracking down why the design >> is what it is, I'm getting concerned about how parallel::mclapply() >> does parallel RNG out of the box when RNGkind("L'Ecuyer-CMRG") is in >> use. For example: >> >>> library(parallel) >>> RNGkind("L'Ecuyer-CMRG") >>> unlist(mclapply(1:2, function(i) rnorm(1))) ## mc.set.seed = TRUE >> [1] -0.71727 0.56171 >>> unlist(mclapply(1:2, function(i) rnorm(1))) >> [1] -0.71727 0.56171 >>> unlist(mclapply(1:2, function(i) rnorm(1))) >> [1] -0.71727 0.56171 >> >> I wonder how many resampling/bootstrap studies are bitten by this >> behavior. Same if you use mc.set.seed = FALSE; >> >>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE)) >> [1] -0.8518311 -0.8518311 >>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE)) >> [1] -0.8518311 -0.8518311 >>> unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE)) >> [1] -0.8518311 -0.8518311 >> >> >> Here is how I think about parallel RNG right now: >> >> 1. To achieve fully numerically reproducible RNGs in way that is >> *invariant to the number of workers* (amount of chunking), I think the >> only solution is to pregenerated RNG seeds (using >> parallel::nextRNGStream()) for each individual iteration (element). >> In other words, if a worker will process K elements, then the main R >> process needs to generate K RNG seeds and pass those along to the >> work. I use this approach for future.apply::future_lapply(..., >> future.seed = TRUE/<initial_seed>), which then produce identical RNG >> results regardless of backend and amount of chunking. In the past, I >> think I've seen Martin suggesting something similar as a manual >> approach to some users. >> >> 2. The above approach is obviously expensive, especially when there >> are a large number of elements to iterate over. Because of this I'm >> thinking providing an option to use only one RNG seed per worker >> (which is the common approach used elsewhere) >> [https://github.com/HenrikBengtsson/future.apply/issues/20]. This >> won't be invariant to the number of workers, but it "should" still be >> statistically sound. This approach will give reproducible RNG results >> given the same initial seed and the same amount of chunking. >> >> 3. For algorithms which do not rely on RNG, we can ignore both of the >> above. The problem is that it's not always known to the >> user/developer which methods depend on RNG or not. The above 'RNG >> tracker' helps to identify some, but things might also change over >> time. I believe there's room for automating this in one way or the >> other. For instance, having a way to declare a function being >> dependent on RNG or not could help. Static code inspection could also >> do it, e.g. when an R package is built and it could be part of the R >> CMD checks to validate. >> >> 4. Are there other approaches? >> >> /Henrik >> >>> >>> The documented behavior is to us the RNGseed= argument to *Param, but I >> think it could be made consistent (by default, obey the global random >> number seed on workers) at least on a single machine (where the default >> number of cores is constant). >>> >>> I have not (yet?) changed the default behavior to SerialParam. I guess >> the cost of SerialParam is from the dependent packages that need to be >> loaded >>> >>>> system.time(suppressPackageStartupMessages(library(DelayedArray))) >>> user system elapsed >>> 3.068 0.082 3.150 >>> >>> If fastMNN() makes several calls to bplapply(), it might make sense to >> start the default cluster at the top of the function once >>> >>> if (!isup(bpparam())) { >>> bpstart(bpparam()) >>> on.exit(bpstop(bpparam())) >>> } >>> >>> Martin >>> >>> On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" < >> bioc-devel-boun...@r-project.org on behalf of >> infinite.monkeys.with.keyboa...@gmail.com> wrote: >>> >>> As we know, the default BiocParallel backends are currently set to >> MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to >> some extent because a new user running, say, bplapply() without additional >> arguments or set-up would expect some kind of parallelization. However, >> from a developer’s perspective, I would argue that it makes more sense to >> use SerialParam() by default. >>> >>> 1. It avoids problems with MulticoreParam stalling (especially on >> Macs) when the randomly chosen port is in already use. This used to be a >> major problem, to the point that all my BiocParallel-using functions in >> scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as >> package default would ensure BiocParallel functions run properly in the >> first place; if the code stalls due to switching to MulticoreParam, then >> it’s obvious where the problem lies (and how to fix it). >>> >>> 2. It avoids the alteration of the random seed when the >> MulticoreParam instance is constructed for the first time. >>> >>> library(BiocParallel) # new R session >>> set.seed(100) >>> invisible(bplapply(1:5, identity)) >>> rnorm(1) # 0.1315312 >>> set.seed(100) >>> invisible(bplapply(1:5, identity)) >>> rnorm(1) # -0.5021924 >>> >>> This is because the first bplapply() call calls bpparam(), which >> constructs a MulticoreParam() for the first time; this calls the PRNG to >> choose a random port number. Ensuing random numbers are altered, as seen >> above. To avoid this, I need to define the MulticoreParam() object prior to >> set.seed(), which undermines the utility of a default-defined bpparam(). >>> >>> 3. Job dispatch via SnowParam() is quite slow, which potentially >> makes Windows package builds run slower by default. A particularly bad >> example is that of scran::fastMNN(), which has a few matrix multiplications >> that use DelayedArray:%*%. The %*% is parallelized with the default >> bpparam(), resulting in SNOW parallelization on Windows. This slowed down >> fastMNN()’s examples from 4 seconds (unix) to ~100 seconds (windows). >> Clearly, serial execution is the faster option here. A related problem is >> MulticoreParam()’s tendency to copy the environment, which may result in >> problems from inflated memory consumption. >>> >>> So, can we default to SerialParam() on all platforms? And by this I >> mean the BiocParallel in-built default - I don’t want to have to instruct >> all my users to put a “register(SerialParam())” at the start of their >> analysis scripts. I feel that BiocParallel’s job is to provide downstream >> code with the potential for parallelization. If end-users want actual >> parallelization, they had better be prepared to specify an appropriate >> scheme via *Param() objects. >>> >>> -A >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel