Re: [R] Random Seed Location

Henrik Bengtsson Sun, 04 Mar 2018 14:55:46 -0800

The following helps identify when .GlobalEnv$.Random.seed has changed:

rng_tracker <- local({
  last <- .GlobalEnv$.Random.seed
  function(...) {
    curr <- .GlobalEnv$.Random.seed
    if (!identical(curr, last)) {
      warning(".Random.seed changed")
      last <<- curr
    }
    TRUE
  }
})


addTaskCallback(rng_tracker, name = "RNG tracker")


EXAMPLE:

> sample.int(1L)
[1] 1
Warning: .Random.seed changed

This will help you find for instance:

## Loading ggplot2 does not affect the RNG seed
> loadNamespace("ggplot2")
<environment: namespace:ggplot2>

## But attaching it does
> library("ggplot2")
Warning: .Random.seed changed

which reveals:

> ggplot2:::.onAttach
function (...)
{
    if (!interactive() || stats::runif(1) > 0.1)
        return()
    tips <- c("Need help? Try the ggplot2 mailing list:
http://groups.google.com/group/ggplot2.";,
        "Find out what's changed in ggplot2 at
http://github.com/tidyverse/ggplot2/releases.";,
        "Use suppressPackageStartupMessages() to eliminate package
startup messages.",
        "Stackoverflow is a great place to get help:
http://stackoverflow.com/tags/ggplot2.";,
        "Need help getting started? Try the cookbook for R:
http://www.cookbook-r.com/Graphs/";,
        "Want to understand how all the pieces fit together? Buy the
ggplot2 book: http://ggplot2.org/book/";)
    tip <- sample(tips, 1)
    packageStartupMessage(paste(strwrap(tip), collapse = "\n"))
}
<environment: namespace:ggplot2>

There are probably many case of this in different R packages.


R WISH:

There could be a

preserveRandomSeed({
  tip <- sample(tips, 1)
})

function in R for these type of random needs where true random
properties are non-critical.  This type of
"draw-a-random-number-and-reset-the-seed" is for instance used in
parallel:::initDefaultClusterOptions() which is called when the
'parallel' package is loaded:

seed <- .GlobalEnv$.Random.seed
ran1 <- sample.int(.Machine$integer.max - 1L, 1L) / .Machine$integer.max
port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1)
if(is.null(seed)) ## there was none, initially
rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE)
else # reset
assign(".Random.seed", seed, envir = .GlobalEnv, inherits = FALSE)

/Henrik

On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <gwblack...@sbcglobal.net> wrote:
> Thank you, everybody, who replied!  I appreciate your valuable advise!  I 
> will move the location of the set.seed() command to after all packages have 
> been installed and loaded.
>
> Best regards,
> Gary
>
> Sent from my iPad
>
>> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <pgilbert...@gmail.com> wrote:
>>
>> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack...@sbcglobal.net>
>> wrote:
>>
>> (Sorry to be a bit slow responding.)
>>
>> You have not supplied a complete example, which would be good in this case 
>> because what you are suggesting could be a serious bug in R or a package. 
>> Serious journals require reproducibility these days. For example, JSS is 
>> very clear on this point.
>>
>> To your question
>> > My question simply is:  should the location of the set.seed command matter,
>> > provided that it is applied before any commands which involve randomness
>> > (such as partitioning)?
>>
>> the answer is no, it should not matter. But the proviso is important.
>>
>> You can determine where things are messing up using something like
>>
>> set.seed(654321)
>> zk <- RNGkind()        # [1] "Mersenne-Twister" "Inversion"
>> zk
>> z <- runif(2)
>> z
>> set.seed(654321)
>>
>> #  install.packages(c('caret', 'ggplot2', 'e1071'))
>> library(caret)
>> all(runif(2)  == z)   # should be true but it is not always
>>
>> set.seed(654321)
>> library(ggplot2)
>> all(runif(2)  == z)   # should be true
>>
>> set.seed(654321)
>> library(e1071)
>> all(runif(2)  == z)   # should be true
>>
>> all(RNGkind() == zk)  # should be true
>>
>> On my computer package caret seems to sometimes, but not always, do 
>> something that advances or changes the RNG. So you will need to set the seed 
>> after that package is loaded if you want reproducibility.
>>
>> As Bill Dunlap points out, parallel can introduce much more complicated 
>> issues. If you are in fact using parallel then we really need a new thread 
>> with a better subject line, and the discussion will get much messier.
>>
>> The short answer is that, yes you should be able to get reproducible results 
>> with parallel computing. If you cannot then you are almost certainly doing 
>> something wrong. To publish you really must have reproducible results.
>>
>> In the example that Bill gave, I think the problem is that set.seed() only 
>> resets the seed in the main thread, the nodes continue to operate with 
>> unreset RNG. To demonstrate this to yourself you can do
>>
>> library(parallel)
>> cl <- parallel::makeCluster(3)
>> parallel::clusterCall(cl, function()set.seed(100))
>> parallel::clusterCall(cl, function()RNGkind())
>> parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes
>>                                               # [1] 0.3077661 0.2576725
>>
>> However, do *NOT* do that in real work. You will be getting the same RNG 
>> stream from each node. If you are using random numbers and parallel you need 
>> to read a lot more, and probably consider a variant of the "L'Ecuyer" 
>> generator or something designed for parallel computing.
>>
>> One special point I will mention because it does not seem to be widely
>> appreciated: the number of nodes affects the random stream, so recording the 
>> number of compute nodes along with the RNG and seed information is important 
>> for reproducible results. This has the unfortunate consequence that an 
>> experiment cannot be reproduced on a larger cluster. (If anyone knows 
>> differently I would very much like to hear.)
>>
>> Paul Gilbert
>>
>>
>> > Hi all,
>> >
>> > For some odd reason when running naïve bayes, k-NN, etc., I get slightly
>> > different results (e.g., error rates, classification probabilities) from
>> > run
>> > to run even though I am using the same random seed.
>> >
>> > Nothing else (input-wise) is changing, but my results are somewhat
>> > different
>> > from run to run.  The only randomness should be in the partitioning, and I
>> > have set the seed before this point.
>> >
>> > My question simply is:  should the location of the set.seed command matter,
>> > provided that it is applied before any commands which involve randomness
>> > (such as partitioning)?
>> >
>> > If you need to see the code, it is below:
>> >
>> > Thank you,
>> > Gary
>> >
>> >
>> > A.      Separate the original (in-sample) data from the new (out-of-sample)
>> > data.  Set a random seed.
>> >
>> >> InvestTech <- as.data.frame(InvestTechRevised)
>> >> outOfSample <- InvestTech[5001:nrow(InvestTech), ]
>> >> InvestTech <- InvestTech[1:5000, ]
>> >> set.seed(654321)
>> >
>> > B.      Install and load the caret, ggplot2 and e1071 packages.
>> >
>> >> install.packages(“caret”)
>> >> install.packages(“ggplot2”)
>> >> install.packages(“e1071”)
>> >> library(caret)
>> >> library(ggplot2)
>> >> library(e1071)
>> >
>> > C.      Bin the predictor variables with approximately equal counts using
>> > the cut_number function from the ggplot2 package.  We will use 20 bins.
>> >
>> >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20)
>> >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20)
>> >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20)
>> >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)
>> >
>> > D.      Partition the original (in-sample) data into 60% training and 40%
>> > validation sets.
>> >
>> >> n <- nrow(InvestTech)
>> >> train <- sample(1:n, size = 0.6 * n, replace = FALSE)
>> >> InvestTechTrain <- InvestTech[train, ]
>> >> InvestTechVal <- InvestTech[-train, ]
>> >
>> > E.      Use the naiveBayes function in the e1071 package to fit the model.
>> >
>> >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain)
>> >> prob <- predict(model, newdata = InvestTechVal, type = “raw”)
>> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>> >
>> > F.      Use the confusionMatrix function in the caret package to output the
>> > confusion matrix.
>> >
>> >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode =
>> > “everything”, positive = “1”)
>> >> accuracy <- confMtr$overall[1]
>> >> valError <- 1 – accuracy
>> >> confMtr
>> >
>> > G.      Classify the 18 new (out-of-sample) readers using the following
>> > code.
>> >> prob <- predict(model, newdata = outOfSample, type = “raw”)
>> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>> >> cbind(pred, prob, outOfSample[, -3])
>> >
>> >
>> >
>>
>>
>> If your computations involve the parallel package then set.seed(seed)
>> may not produce repeatable results.  E.g.,
>>
>> > cl <- parallel::makeCluster(3)  # Create cluster with 3 nodes on local
>> host
>> > set.seed(100); runif(2)
>> [1] 0.3077661 0.2576725
>> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>>         [,1]     [,2]     [,3]
>> [1,] 101.7779 102.5308 103.3459
>> [2,] 101.8128 102.6114 103.9102
>> >
>> > set.seed(100); runif(2)
>> [1] 0.3077661 0.2576725
>> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>>         [,1]     [,2]     [,3]
>> [1,] 101.1628 102.9643 103.2684
>> [2,] 101.9205 102.6937 103.7907
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Seed Location

Reply via email to