On Fri, 2005-12-30 at 18:37 +0100, justin bem wrote: > Hi, > > I have a difficulty with the bootstrap procedure in boot package. > How can I specify the size of sample at each bootstrap ? > I use > >myboot<-boot(data,boot.fun,R=300) > when I display myboot$t I get a vector with just one value the than > the compute statistic in my data file after reading about bootstrap in > the book MASS I add > >set.seed(101) > But I get the same result. than mean than at each boot the sample > draw is exactly the data file. How can I do resolve this ? > > Sincerly !
Justin, First, when creating a new post, please do not do so by replying to an existing post. This screws up the threading in the list archive, making it difficult for folks to search for your post and replies, when the thread ends up being embedded in another unrelated one. Second, with bootstrapping, the bootstrap samples are done using sampling with replacement from the original sample. Thus, the bootstrap replicates are, by default, the same size as the original sample. You might want to look at the examples in ?sample, specifically the resample() function created there, for some additional insight. Using set.seed() simply allows for the ability to repeat the same sequence of sampling randomizations. It has no effect on the replicate sample size. In terms of what you are seeing with respect to only getting a single statistic value, I suspect that there is a problem in how you have defined your statistic function. Using the example from page 134 in MASS4: library(MASS) library(boot) set.seed(101) gal.boot <- boot(galaxies, function(x, i) median(x[i]), R = 1000) > str(gal.boot$t) num [1:1000, 1] 20930 21492 20196 20179 21184 ... Here 't' is of length 1000, the same as R. The median function is passed the galaxies data as 'x' and a set of sampled indices 'i'. Thus, x[i] is the randomly selected replicate passed to median(). Each replicate has 82 elements (the length of galaxies) and this is repeated 1000 times. Note that by default, the 'statistic' function in boot() is defined to take two arguments, 'data' and 'i'. Since median() does not use these, we create the boot() function call as above. We could have done it as: my.median <- function(data, i) { median(data[i]) } set.seed(101) my.boot <- boot(galaxies, my.median, R = 1000) # Now let's compare the resultant boot objects > all.equal(gal.boot, my.boot) [1] "Component 6: target, current don't match when deparsed" [2] "Component 8: target, current don't match when deparsed" They are identical except for the 'statistic' and 'call' elements, which simply reflects the different statistic function expressions used. The key is that the statistic function (in a routine bootstrap such as these) takes two arguments. It could take more (see below). If you want to get a feel for how each of the 1000 replicates looks, you could use the boot.array() function: boot.array(gal.boot, indices = TRUE) boot.array() with 'indices = TRUE' will return a matrix of values representing each replicate as a row (for 1000 rows in this case) and the _indices_ of the values from galaxies (1:82) that was used in each replicate (row). With 'indices = FALSE' (the default here), boot.array() will give you information pertaining to how frequently each of the 82 elements in galaxies was used in each replicate. Keep in mind that this is sampling WITH replacement, so some values will be used more than once in each replicate. For example: > table(boot.array(gal.boot, indices = FALSE)) 0 1 2 3 4 5 6 7 30027 30317 15134 4998 1248 233 37 6 This shows that the 82 values in galaxies were used anywhere from 0 to 7 times in any given replicate. The sum() of the above is 82000 of course and each replicate does have 82 elements: # Get the sum of each row from the boot.array() call # above and create a table > table(rowSums(boot.array(gal.boot, indices = FALSE))) 82 1000 If you actually wanted to generate the replicates used, you could do something like: matrix(galaxies[boot.array(gal.boot, indices = TRUE)], ncol = 82, byrow = TRUE) which would return a 1000 x 82 matrix with the actual galaxies data as sampled in the 1000 replicates. In terms of defining the sample size in each replicate, there is an example in Davison and Hinkley's book (cited in ?boot) and for which the boot package is supporting software. The example is on page 528 and provides an approach for specifying the sample size for the replicates, if that it what you truly want to do. This example uses the 'city' data: > city u x 1 138 143 2 93 104 3 61 69 4 179 260 5 48 75 6 37 63 7 29 50 8 23 48 9 30 111 10 2 50 city.subset <- function(data, i, n = 10) { # This step takes the sampled indices # and returns the first 'n' of them. # Then subsets 'data' using these, in # this case, using the first 'n' rows of # all of the sampled rows in the city dataset d <- data[i[1:n], ] # Now the statistic is calculated on a # replicate with a sample size of 'n' mean(d[, 2]) / mean(d[, 1]) } # Let's do the boot with 200 replicates # each with a sample size of 5. 'n' is passed as # part of the "..." argument in boot(). city.boot <- boot(city, city.subset, R = 200, n = 5) Note that the 'statistic' function here takes three arguments: data: this will be 'city' i: sampled indices n: replicate sample size As pointed out in D&H, the boot.array() function would need to be adjusted here to work properly: boot.array(city.boot, indices = TRUE)[, 1:5] so that you only return the first 5 values for each replicate. There are important considerations if your replicates are sized less than the original sample size, so I would not do this lightly and would recommend securing a copy of D&H to cover some of this area. HTH, Marc Schwartz ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html