The following helps identify when .GlobalEnv$.Random.seed has changed: rng_tracker <- local({ last <- .GlobalEnv$.Random.seed function(...) { curr <- .GlobalEnv$.Random.seed if (!identical(curr, last)) { warning(".Random.seed changed") last <<- curr } TRUE } })
addTaskCallback(rng_tracker, name = "RNG tracker") EXAMPLE: > sample.int(1L) [1] 1 Warning: .Random.seed changed This will help you find for instance: ## Loading ggplot2 does not affect the RNG seed > loadNamespace("ggplot2") <environment: namespace:ggplot2> ## But attaching it does > library("ggplot2") Warning: .Random.seed changed which reveals: > ggplot2:::.onAttach function (...) { if (!interactive() || stats::runif(1) > 0.1) return() tips <- c("Need help? Try the ggplot2 mailing list: http://groups.google.com/group/ggplot2.", "Find out what's changed in ggplot2 at http://github.com/tidyverse/ggplot2/releases.", "Use suppressPackageStartupMessages() to eliminate package startup messages.", "Stackoverflow is a great place to get help: http://stackoverflow.com/tags/ggplot2.", "Need help getting started? Try the cookbook for R: http://www.cookbook-r.com/Graphs/", "Want to understand how all the pieces fit together? Buy the ggplot2 book: http://ggplot2.org/book/") tip <- sample(tips, 1) packageStartupMessage(paste(strwrap(tip), collapse = "\n")) } <environment: namespace:ggplot2> There are probably many case of this in different R packages. R WISH: There could be a preserveRandomSeed({ tip <- sample(tips, 1) }) function in R for these type of random needs where true random properties are non-critical. This type of "draw-a-random-number-and-reset-the-seed" is for instance used in parallel:::initDefaultClusterOptions() which is called when the 'parallel' package is loaded: seed <- .GlobalEnv$.Random.seed ran1 <- sample.int(.Machine$integer.max - 1L, 1L) / .Machine$integer.max port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1) if(is.null(seed)) ## there was none, initially rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE) else # reset assign(".Random.seed", seed, envir = .GlobalEnv, inherits = FALSE) /Henrik On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <gwblack...@sbcglobal.net> wrote: > Thank you, everybody, who replied! I appreciate your valuable advise! I > will move the location of the set.seed() command to after all packages have > been installed and loaded. > > Best regards, > Gary > > Sent from my iPad > >> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <pgilbert...@gmail.com> wrote: >> >> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack...@sbcglobal.net> >> wrote: >> >> (Sorry to be a bit slow responding.) >> >> You have not supplied a complete example, which would be good in this case >> because what you are suggesting could be a serious bug in R or a package. >> Serious journals require reproducibility these days. For example, JSS is >> very clear on this point. >> >> To your question >> > My question simply is: should the location of the set.seed command matter, >> > provided that it is applied before any commands which involve randomness >> > (such as partitioning)? >> >> the answer is no, it should not matter. But the proviso is important. >> >> You can determine where things are messing up using something like >> >> set.seed(654321) >> zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" >> zk >> z <- runif(2) >> z >> set.seed(654321) >> >> # install.packages(c('caret', 'ggplot2', 'e1071')) >> library(caret) >> all(runif(2) == z) # should be true but it is not always >> >> set.seed(654321) >> library(ggplot2) >> all(runif(2) == z) # should be true >> >> set.seed(654321) >> library(e1071) >> all(runif(2) == z) # should be true >> >> all(RNGkind() == zk) # should be true >> >> On my computer package caret seems to sometimes, but not always, do >> something that advances or changes the RNG. So you will need to set the seed >> after that package is loaded if you want reproducibility. >> >> As Bill Dunlap points out, parallel can introduce much more complicated >> issues. If you are in fact using parallel then we really need a new thread >> with a better subject line, and the discussion will get much messier. >> >> The short answer is that, yes you should be able to get reproducible results >> with parallel computing. If you cannot then you are almost certainly doing >> something wrong. To publish you really must have reproducible results. >> >> In the example that Bill gave, I think the problem is that set.seed() only >> resets the seed in the main thread, the nodes continue to operate with >> unreset RNG. To demonstrate this to yourself you can do >> >> library(parallel) >> cl <- parallel::makeCluster(3) >> parallel::clusterCall(cl, function()set.seed(100)) >> parallel::clusterCall(cl, function()RNGkind()) >> parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes >> # [1] 0.3077661 0.2576725 >> >> However, do *NOT* do that in real work. You will be getting the same RNG >> stream from each node. If you are using random numbers and parallel you need >> to read a lot more, and probably consider a variant of the "L'Ecuyer" >> generator or something designed for parallel computing. >> >> One special point I will mention because it does not seem to be widely >> appreciated: the number of nodes affects the random stream, so recording the >> number of compute nodes along with the RNG and seed information is important >> for reproducible results. This has the unfortunate consequence that an >> experiment cannot be reproduced on a larger cluster. (If anyone knows >> differently I would very much like to hear.) >> >> Paul Gilbert >> >> >> > Hi all, >> > >> > For some odd reason when running naïve bayes, k-NN, etc., I get slightly >> > different results (e.g., error rates, classification probabilities) from >> > run >> > to run even though I am using the same random seed. >> > >> > Nothing else (input-wise) is changing, but my results are somewhat >> > different >> > from run to run. The only randomness should be in the partitioning, and I >> > have set the seed before this point. >> > >> > My question simply is: should the location of the set.seed command matter, >> > provided that it is applied before any commands which involve randomness >> > (such as partitioning)? >> > >> > If you need to see the code, it is below: >> > >> > Thank you, >> > Gary >> > >> > >> > A. Separate the original (in-sample) data from the new (out-of-sample) >> > data. Set a random seed. >> > >> >> InvestTech <- as.data.frame(InvestTechRevised) >> >> outOfSample <- InvestTech[5001:nrow(InvestTech), ] >> >> InvestTech <- InvestTech[1:5000, ] >> >> set.seed(654321) >> > >> > B. Install and load the caret, ggplot2 and e1071 packages. >> > >> >> install.packages(“caret”) >> >> install.packages(“ggplot2”) >> >> install.packages(“e1071”) >> >> library(caret) >> >> library(ggplot2) >> >> library(e1071) >> > >> > C. Bin the predictor variables with approximately equal counts using >> > the cut_number function from the ggplot2 package. We will use 20 bins. >> > >> >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >> >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >> >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >> >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) >> > >> > D. Partition the original (in-sample) data into 60% training and 40% >> > validation sets. >> > >> >> n <- nrow(InvestTech) >> >> train <- sample(1:n, size = 0.6 * n, replace = FALSE) >> >> InvestTechTrain <- InvestTech[train, ] >> >> InvestTechVal <- InvestTech[-train, ] >> > >> > E. Use the naiveBayes function in the e1071 package to fit the model. >> > >> >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) >> >> prob <- predict(model, newdata = InvestTechVal, type = “raw”) >> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> > >> > F. Use the confusionMatrix function in the caret package to output the >> > confusion matrix. >> > >> >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = >> > “everything”, positive = “1”) >> >> accuracy <- confMtr$overall[1] >> >> valError <- 1 – accuracy >> >> confMtr >> > >> > G. Classify the 18 new (out-of-sample) readers using the following >> > code. >> >> prob <- predict(model, newdata = outOfSample, type = “raw”) >> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> >> cbind(pred, prob, outOfSample[, -3]) >> > >> > >> > >> >> >> If your computations involve the parallel package then set.seed(seed) >> may not produce repeatable results. E.g., >> >> > cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local >> host >> > set.seed(100); runif(2) >> [1] 0.3077661 0.2576725 >> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) >> [,1] [,2] [,3] >> [1,] 101.7779 102.5308 103.3459 >> [2,] 101.8128 102.6114 103.9102 >> > >> > set.seed(100); runif(2) >> [1] 0.3077661 0.2576725 >> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) >> [,1] [,2] [,3] >> [1,] 101.1628 102.9643 103.2684 >> [2,] 101.9205 102.6937 103.7907 >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.