dear Hervé, Thanks again for the quick and useful reply!
I think that the theory behind the block bootstrap [Kunsch (1989), Liu and Singh (1992), Politis and Romano (1994)], needs that the blocks be drawn with replacement (you can get some features twice) and that the blocks can be overlapping. In a hand-waving way, I think, it's "good" for the variance estimation on any statistic of interest that y' may have more or less features than y. I will explore a bit using the solutions you've laid out. Now that I think about it, the start-position based solution that I was thinking about will break if two features in y share the same start position, so that's not good. On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > That helps. I think I start to understand what you are after. > > See below... > > > On 08/13/2018 06:07 PM, Michael Love wrote: >> >> dear Hervé, >> >> Thanks for the quick reply about directions to take this. >> >> I'm sorry for not providing sufficient detail about the goal of block >> bootstrapping in my initial post. Let me try again. For a moment, let >> me ignore multiple chromosomes/seqs and just focus on a single set of >> IRanges. >> >> The point of the block bootstrap is: Let's say we want to find the >> number of overlaps of x and y, and then assess how surprised we are at >> how large that overlap is. Both of them may have features that tend to >> cluster together along the genome (independently). One method would >> just be to move the features in y around to random start sites, making >> y', say B times, and then calculate each of the B times the number of >> overlaps between x and y'. Or we might make this better by having >> blacklisted sites where the randomly shuffled features in y cannot go. >> >> The block bootstrap is an alternative to randomly moving the start >> sites, where instead we create random data, by taking big "blocks" of >> features in y. Each block is a lot like a View. And the ultimate goal >> is to make B versions of the data y where the features have been >> shuffled around, but by taking blocks, we preserve the clumpiness of >> the features in y. >> >> Let me give some numbers to make this more concrete, so say we're >> making a single block bootstrap sample of a chromosome that is 1000 bp >> long. Here is the original y: >> >> y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5) >> >> If I go with my coverage approach, I should extend it all the way to >> the end of the chromosome. Here I lose information if there are >> overlaps of features in y, and I'm thinking of a fix I'll describe >> below. >> >> cov <- c(coverage(y), Rle(rep(0,55))) >> >> I could make one block bootstrap sample of y (this is 1 out of B in >> the ultimate procedure) by taking 10 blocks of width 100. The blocks >> have random start positions from 1 to 901. >> >> y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100)) > > > Choosing blocks that can overlap with each others could make y' appear > to have more features than y (by repeating some of the original > features). Also choosing blocks that can leave big gaps in the > chromosome could make y' appear to have less features than y > (by dropping some of the original ranges). Isn't that a problem? > > Have you considered choosing a set of blocks that represent a > partitioning of the chromosome? This would make y' look more like y > by preserving the number of original ranges. > > In other words, if you can assign each range in y to one block and > one block only, then you could generate y' by just shifting the > ranges in y. The exact amount of shifting (positive or negative) > would only depend on the block that the range belongs to. This would > give you an y' that is parallel to y (i.e. same number of ranges > and y'[i] corresponds to y[i]). > > Something like this: > > 1) Define the blocks (must be a partitioning of the chromosome): > > blocks <- successiveIRanges(rep(100, 10)) > > 2) Assign each range in y to the block it belongs to: > > mcols(y)$block <- findOverlaps(y, blocks, select="first") > > 1) and 2) are preliminary steps that you only need to do once, > before you generate B versions of the shuffled data (what you > call y'). The next steps are for generating one version of the > shuffled data so would need to be repeated B times. > > 3) Shuffle the blocks: > > perm <- sample(length(blocks)) > perm > # [1] 6 5 8 3 2 7 1 9 4 10 > > permuted_blocks <- successiveIRanges(width(blocks)[perm]) > permuted_blocks[perm] <- permuted_blocks > > 4) Compute the shift along the chromosome that each block went > thru: > > block_shift <- start(permuted_blocks) - start(blocks) > > 5) Apply this shift to the ranges in y: > > shift(y, block_shift[mcols(y)$block]) > # IRanges object with 12 ranges and 1 metadata column: > # start end width | block > # <integer> <integer> <integer> | <integer> > # [1] 651 655 5 | 1 > # [2] 661 665 5 | 1 > # [3] 671 675 5 | 1 > # [4] 411 415 5 | 2 > # [5] 421 425 5 | 2 > # ... ... ... ... . ... > # [8] 11 15 5 | 6 > # [9] 21 25 5 | 6 > # [10] 921 925 5 | 10 > # [11] 931 935 5 | 10 > # [12] 941 945 5 | 10 > > This code would work with overlapping ranges in y and/or > blocks of variable size. > > Hope this helps, > > H. > >> >> This last line below is a hack to get back to the ranges for a single >> block bootstrap sample of y. It works here, but only because none of >> the features in y were overlapping each other. >> >> Instead of coverage(), if I'd made an Rle where the non-zero elements >> are located at the start of ranges, and the value is the width of the >> range, I think this Views approach would actually work. >> >> y.boot.1.rng <- IRanges(start(y.boot.1)[runValue(y.boot.1) == 1], >> width=runLength(y.boot.1)[runValue(y.boot.1) == 1]) >> >> I'm interested in building a function that takes in IRanges and >> outputs these shuffled set of IRanges. >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel