On Fri, Feb 17, 2012 at 3:23 PM, Paul Gilbert <pgilbert...@gmail.com> wrote: > Paul > > I think (perhaps incorrectly) of the general problem being that one wants to > run a random experiment, on a single node, or two nodes, or ten nodes, or > any number of nodes, and reliably be able to reproduce the experiment > without concern about how many nodes it runs on when you re-run it. > > From your description I don't have the impression your solution would do > that. Am I misunderstanding? >
Well, I think my approach does that! Each time a function runs, it grabs a pre-specified set of seed values and initializes the R .Random.seed appropriately. Since I take the pre-specified seeds from the L'Ecuyer et al approach (cite below), I believe that means each separate stream is dependably uncorrelated and non-overlapping, both within a particular run and across runs. > A second problem is that you want to use a proven algorithm for generating > the numbers. This is implicitly solved by the above, because you always get > the same result as you do on one node with a well proven RNG. If you > generate a string of seed and then numbers from those, do you have a proven > RNG? > Luckily, I think that part was solved by people other than me: L'Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002) An object-oriented random-number package with many long streams and substreams. Operations Research 50 1073–5. http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf > Paul > > > On 12-02-17 03:57 PM, Paul Johnson wrote: >> >> I've got another edition of my simulation replication framework. I'm >> attaching 2 R files and pasting in the readme. >> >> I would especially like to know if I'm doing anything that breaks >> .Random.seed or other things that R's parallel uses in the >> environment. >> >> In case you don't want to wrestle with attachments, the same files are >> online in our SVN >> >> >> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/ >> >> >> ## Paul E. Johnson CRMDA<paulj...@ku.edu> >> ## Portable Parallel Seeds Project. >> ## 2012-02-18 >> >> Portable Parallel Seeds Project >> >> This is how I'm going to recommend we work with random number seeds in >> simulations. It enhances work that requires runs with random numbers, >> whether runs are in a cluster computing environment or in a single >> workstation. >> >> It is a solution for two separate problems. >> >> Problem 1. I scripted up 1000 R runs and need high quality, >> unique, replicable random streams for each one. Each simulation >> runs separately, but I need to be confident their streams are >> not correlated or overlapping. For replication, I need to be able to >> select any run, say 667, and restart it exactly as it was. >> >> Problem 2. I've written a Parallel MPI (Message Passing Interface) >> routine that launches 1000 runs and I need to assure each has >> a unique, replicatable, random stream. I need to be able to >> select any run, say 667, and restart it exactly as it was. >> >> This project develops one approach to create replicable simulations. >> It blends ideas about seed management from John M. Chambers >> Software for Data Analysis (2008) with ideas from the snowFT >> package by Hana Sevcikova and Tony R. Rossini. >> >> >> Here's my proposal. >> >> 1. Run a preliminary program to generate an array of seeds >> >> run1: seed1.1 seed1.2 seed1.3 >> run2: seed2.1 seed2.2 seed2.3 >> run3: seed3.1 seed3.2 seed3.3 >> ... ... ... >> run1000 seed1000.1 seed1000.2 seed1000.3 >> >> This example provides 3 separate streams of random numbers within each >> run. Because we will use the L'Ecuyer "many separate streams" >> approach, we are confident that there is no correlation or overlap >> between any of the runs. >> >> The projSeeds has to have one row per project, but it is not a huge >> file. I created seeds for 2000 runs of a project that requires 2 seeds >> per run. The saved size of the file 104443kb, which is very small. By >> comparison, a 1400x1050 jpg image would usually be twice that size. >> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb, >> still pretty small. >> >> Because the seeds are saved in a file, we are sure each >> run can be replicated. We just have to teach each program >> how to use the seeds. That is step two. >> >> >> 2. Inside each run, an initialization function runs that loads the >> seeds file and takes the row of seeds that it needs. As the >> simulation progresses, the user can ask for random numbers from the >> separate streams. When we need random draws from a particular stream, >> we set the variable "currentStream" with the function useStream(). >> >> The function initSeedStreams creates several objects in >> the global environment. It sets the integer currentStream, >> as well as two list objects, startSeeds and currentSeeds. >> At the outset of the run, startSeeds and currentSeeds >> are the same thing. When we change the currentStream >> to a different stream, the currentSeeds vector is >> updated to remember where that stream was when we stopped >> drawing numbers from it. >> >> >> Now, for the proof of concept. A working example. >> >> Step 1. Create the Seeds. Review the R program >> >> seedCreator.R >> >> That creates the file "projSeeds.rda". >> >> >> Step 2. Use one row of seeds per run. >> >> Please review "controlledSeeds.R" to see an example usage >> that I've tested on a cluster. >> >> "controlledSeeds.R" can also be run on a single workstation for >> testing purposes. There is a variable "runningInMPI" which determines >> whether the code is supposed to run on the RMPI cluster or just in a >> single workstation. >> >> >> The code for each run of the model begins by loading the >> required libraries and loading the seed file, if it exists, or >> generating a new "projSeed" object if it is not found. >> >> library(parallel) >> RNGkind("L'Ecuyer-CMRG") >> set.seed(234234) >> if (file.exists("projSeeds.rda")) { >> load("projSeeds.rda") >> } else { >> source("seedCreator.R") >> } >> >> ## Suppose the "run" number is: >> run<- 232 >> initSeedStreams(run) >> >> After that, R's random generator functions will draw values >> from the first random random stream that was initialized >> in projSeeds. When each repetition (run) occurs, >> R looks up the right seed for that run, and uses it. >> >> If the user wants to begin drawing observations from the >> second random stream, this command is used: >> >> useStream(2) >> >> If the user has drawn values from stream 1 already, but >> wishes to begin again at the initial point in that stream, >> use this command >> >> useStream(1, origin = TRUE) >> >> >> Question: Why is this approach better for parallel runs? >> >> Answer: After a batch of simulations, we can re-start any >> one of them and repeat it exactly. This builds on the idea >> of the snowFT package, by Hana Sevcikova and A.J. Rossini. >> >> That is different from the default approach of most R parallel >> designs, including R's own parallel, RMPI and snow. >> >> The ordinary way of controlling seeds in R parallel would initialize >> the 50 nodes, and we would lose control over seeds because runs would >> be repeatedly assigned to nodes. The aim here is to make sure that >> each particular run has a known starting point. After a batch of >> 10,000 runs, we can look and say "something funny happened on run >> 1,323" and then we can bring that back to life later, easily. >> >> >> >> Question: Why is this better than the simple old approach of >> setting the seeds within each run with a formula like >> >> set.seed(2345 + 10 * run) >> >> Answer: That does allow replication, but it does not assure >> that each run uses non-overlapping random number streams. It >> offers absolutely no assurance whatsoever that the runs are >> actually non-redundant. >> >> Nevertheless, it is a method that is widely used and recommended >> by some visible HOWTO guides. >> >> >> >> Citations >> >> Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant >> Simple Network of Workstations. R package version 1.2-0. >> http://CRAN.R-project.org/package=snowFT >> >> John M Chambers (2008). SoDA: Functions and Exampels for "Software >> for Data Analysis". R package version 1.0-3. >> >> John M Chambers (2008) Software for Data Analysis. Springer. >> >> >> >> >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel