I hope for 1. to have a 'local socket' (i.e., not using ports) implementation 
shortly.

I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We now have

> library(BiocParallel)
> set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
> set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538

at the expensive of using the generator when the package is loaded.

> set.seed(1); rnorm(1)
[1] -0.6264538
> set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433

Is that bad? It will be consistent across platforms.

This behavior

> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] 0.9624337 0.8925947
> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597  0.1102093

seems wrong, but is consistent with mclapply

> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527  0.40721777
> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.8239765  1.2957928

The documented behavior is to us the RNGseed= argument to *Param, but I think 
it could be made consistent (by default, obey the global random number seed on 
workers) at least on a single machine (where the default number of cores is 
constant).

I have not (yet?) changed the default behavior to SerialParam. I guess the cost 
of SerialParam is from the dependent packages that need to be loaded

> system.time(suppressPackageStartupMessages(library(DelayedArray)))
   user  system elapsed
  3.068   0.082   3.150

If fastMNN() makes several calls to bplapply(), it might make sense to start 
the default cluster at the top of the function once

    if (!isup(bpparam())) {
        bpstart(bpparam())
        on.exit(bpstop(bpparam()))
    }

Martin

On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" 
<bioc-devel-boun...@r-project.org on behalf of 
infinite.monkeys.with.keyboa...@gmail.com> wrote:

    As we know, the default BiocParallel backends are currently set to 
MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to 
some extent because a new user running, say, bplapply() without additional 
arguments or set-up would expect some kind of parallelization. However, from a 
developer’s perspective, I would argue that it makes more sense to use 
SerialParam() by default. 
    
    1. It avoids problems with MulticoreParam stalling (especially on Macs) 
when the randomly chosen port is in already use. This used to be a major 
problem, to the point that all my BiocParallel-using functions in scran passed 
BPPARAM=SerialParam() by default. Setting SerialParam() as package default 
would ensure BiocParallel functions run properly in the first place; if the 
code stalls due to switching to MulticoreParam, then it’s obvious where the 
problem lies (and how to fix it).
    
    2. It avoids the alteration of the random seed when the MulticoreParam 
instance is constructed for the first time. 
    
    library(BiocParallel) # new R session
    set.seed(100)
    invisible(bplapply(1:5, identity))
    rnorm(1) # 0.1315312
    set.seed(100)
    invisible(bplapply(1:5, identity))
    rnorm(1) # -0.5021924
    
    This is because the first bplapply() call calls bpparam(), which constructs 
a MulticoreParam() for the first time; this calls the PRNG to choose a random 
port number. Ensuing random numbers are altered, as seen above. To avoid this, 
I need to define the MulticoreParam() object prior to set.seed(), which 
undermines the utility of a default-defined bpparam().
    
    3. Job dispatch via SnowParam() is quite slow, which potentially makes 
Windows package builds run slower by default. A particularly bad example is 
that of scran::fastMNN(), which has a few matrix multiplications that use 
DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting 
in SNOW parallelization on Windows. This slowed down fastMNN()’s examples from 
4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the 
faster option here. A related problem is MulticoreParam()’s tendency to copy 
the environment, which may result in problems from inflated memory consumption.
    
    So, can we default to SerialParam() on all platforms? And by this I mean 
the BiocParallel in-built default - I don’t want to have to instruct all my 
users to put a “register(SerialParam())” at the start of their analysis 
scripts. I feel that BiocParallel’s job is to provide downstream code with the 
potential for parallelization. If end-users want actual parallelization, they 
had better be prepared to specify an appropriate scheme via *Param() objects. 
    
    -A
    
    
    
    
        [[alternative HTML version deleted]]
    
    _______________________________________________
    Bioc-devel@r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel
    
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to