Re: [Bioc-devel] Block bootstrap for GenomicRanges

Hervé Pagès Mon, 13 Aug 2018 20:58:30 -0700

That helps. I think I start to understand what you are after.


See below...

On 08/13/2018 06:07 PM, Michael Love wrote:

dear Hervé,

Thanks for the quick reply about directions to take this.

I'm sorry for not providing sufficient detail about the goal of block
bootstrapping in my initial post. Let me try again. For a moment, let
me ignore multiple chromosomes/seqs and just focus on a single set of
IRanges.

The point of the block bootstrap is: Let's say we want to find the
number of overlaps of x and y, and then assess how surprised we are at
how large that overlap is. Both of them may have features that tend to
cluster together along the genome (independently). One method would
just be to move the features in y around to random start sites, making
y', say B times, and then calculate each of the B times the number of
overlaps between x and y'. Or we might make this better by having
blacklisted sites where the randomly shuffled features in y cannot go.

The block bootstrap is an alternative to randomly moving the start
sites, where instead we create random data, by taking big "blocks" of
features in y. Each block is a lot like a View. And the ultimate goal
is to make B versions of the data y where the features have been
shuffled around, but by taking blocks, we preserve the clumpiness of
the features in y.

Let me give some numbers to make this more concrete, so say we're
making a single block bootstrap sample of a chromosome that is 1000 bp
long. Here is the original y:

y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)

If I go with my coverage approach, I should extend it all the way to
the end of the chromosome. Here I lose information if there are
overlaps of features in y, and I'm thinking of a fix I'll describe
below.

cov <- c(coverage(y), Rle(rep(0,55)))

I could make one block bootstrap sample of y (this is 1 out of B in
the ultimate procedure) by taking 10 blocks of width 100. The blocks
have random start positions from 1 to 901.

y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))


Choosing blocks that can overlap with each others could make y' appear
to have more features than y (by repeating some of the original
features). Also choosing blocks that can leave big gaps in the
chromosome could make y' appear to have less features than y
(by dropping some of the original ranges). Isn't that a problem?

Have you considered choosing a set of blocks that represent a
partitioning of the chromosome? This would make y' look more like y
by preserving the number of original ranges.

In other words, if you can assign each range in y to one block and
one block only, then you could generate y' by just shifting the
ranges in y. The exact amount of shifting (positive or negative)
would only depend on the block that the range belongs to. This would
give you an y' that is parallel to y (i.e. same number of ranges
and y'[i] corresponds to y[i]).

Something like this:

1) Define the blocks (must be a partitioning of the chromosome):

  blocks <- successiveIRanges(rep(100, 10))

2) Assign each range in y to the block it belongs to:

  mcols(y)$block <- findOverlaps(y, blocks, select="first")

1) and 2) are preliminary steps that you only need to do once,
before you generate B versions of the shuffled data (what you
call y'). The next steps are for generating one version of the
shuffled data so would need to be repeated B times.

3) Shuffle the blocks:

  perm <- sample(length(blocks))
  perm
  # [1]  6  5  8  3  2  7  1  9  4 10

  permuted_blocks <- successiveIRanges(width(blocks)[perm])
  permuted_blocks[perm] <- permuted_blocks

4) Compute the shift along the chromosome that each block went
thru:

  block_shift <- start(permuted_blocks) - start(blocks)

5) Apply this shift to the ranges in y:

  shift(y, block_shift[mcols(y)$block])
  # IRanges object with 12 ranges and 1 metadata column:
  #          start       end     width |     block
  #      <integer> <integer> <integer> | <integer>
  #  [1]       651       655         5 |         1
  #  [2]       661       665         5 |         1
  #  [3]       671       675         5 |         1
  #  [4]       411       415         5 |         2
  #  [5]       421       425         5 |         2
  #  ...       ...       ...       ... .       ...
  #  [8]        11        15         5 |         6
  #  [9]        21        25         5 |         6
  # [10]       921       925         5 |        10
  # [11]       931       935         5 |        10
  # [12]       941       945         5 |        10

This code would work with overlapping ranges in y and/or
blocks of variable size.

Hope this helps,

H.


This last line below is a hack to get back to the ranges for a single
block bootstrap sample of y. It works here, but only because none of
the features in y were overlapping each other.

Instead of coverage(), if I'd made an Rle where the non-zero elements
are located at the start of ranges, and the value is the width of the
range, I think this Views approach would actually work.

y.boot.1.rng <- IRanges(start(y.boot.1)[runValue(y.boot.1) == 1],
   width=runLength(y.boot.1)[runValue(y.boot.1) == 1])

I'm interested in building a function that takes in IRanges and
outputs these shuffled set of IRanges.


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Block bootstrap for GenomicRanges

Reply via email to