Hi Jim,

On 06/21/2010 04:56 PM, James Bullard wrote:
Hi,

This must be readily doable and I am just missing the obvious -- I am
trying, for a long sequence, to classify each position as coming from a
particular context. Something like running nucleotide table. The things
that I have thought about doing are:

.) Making Views representing my sliding windows, but then I don't see an
obvious thing except nucleotide frequency, but that doesn't really solve
the problem.

Making Views representing a sliding windows on a very long sequence like
e.g. Human chr1 generates a big object: 2 integers per position -> 2GB!
And then there is the problem that looping over the views in R is not an
option. So yes, as Patrick mentioned, there is the
letterFrequencyInSlidingView() function that addresses the particular
case of extracting the nucleotide frequencies for each view (it's
optimized, all written in C, and avoids the cost of actually
generating the Views object) but we don't have a general and
efficient mechanism for applying an arbitrary function to the
sliding views.

However, if one wants to compute the frequency of an arbitrary pattern
along the sliding views, something like this can be done:

  library(Biostrings)

  countPatternInSlidingView <- function(pattern, subject, view.width)
  {
    view.width <- as.integer(view.width)
    if (length(subject) < view.width)
        stop("length(subject) < view.width")
    nviews <- length(subject) - view.width + 1L
    if (nchar(pattern) > view.width)
        return(integer(nviews))
    hits <- matchPattern(pattern, subject, algo="naive-exact")
    NLP <- function(total_length, points)
    {
      idx <- logical(total_length)
      idx[points] <- TRUE
      NLP <- cumsum(idx)
      return(NLP)
    }
    ## 'NLP0' has the length of 'subject'
    NLP0 <- NLP(length(subject), start(hits))
    ## 'd1' is guaranteed to be >= 0
    d1 <- view.width - nchar(pattern)
    NLP1 <- NLP0[-seq_len(d1)]
    length(NLP1) <- nviews
    NLP2 <- c(0L, NLP0)
    length(NLP2) <- nviews
    return(NLP1 - NLP2)
  }

Then:
  > countPatternInSlidingView("AT", DNAString("ATATTGATA"),
                              view.width=4)
  [1] 2 1 1 0 1 1

It's not as fast as letterFrequencyInSlidingView() but is still ok:

  > library(BSgenome.Hsapiens.UCSC.hg18)
  > chr1 <- unmasked(Hsapiens$chr1)
  > system.time(TAACCCfreq <- countPatternInSlidingView("TAACCC",
                                              chr1, view.width=40))
     user  system elapsed
   14.250   5.470  21.099
  > TAACCCfreq[1:20]
   [1] 6 5 6 6 6 6 6 5 6 6 6 6 6 5 6 6 6 6 6 5

Not sure this helps for the problem you have though...


.) Using nucleotideFrequencyAt, however, I am not sure that this function
does what I want. Since it takes an XStringSet type object rather than a
DNAString type object.

If you've defined your sliding views on your sequence e.g.:

  > v <- Views(DNAString("ACCCGTCGTGTTGAAT"), start=1:13, width=4)
  > v
    Views on a 16-letter DNAString subject
  subject: ACCCGTCGTGTTGAAT
  views:
       start end width
   [1]     1   4     4 [ACCC]
   [2]     2   5     4 [CCCG]
   [3]     3   6     4 [CCGT]
   [4]     4   7     4 [CGTC]
   [5]     5   8     4 [GTCG]
   [6]     6   9     4 [TCGT]
   [7]     7  10     4 [CGTG]
   [8]     8  11     4 [GTGT]
   [9]     9  12     4 [TGTT]
  [10]    10  13     4 [GTTG]
  [11]    11  14     4 [TTGA]
  [12]    12  15     4 [TGAA]
  [13]    13  16     4 [GAAT]

Then you can call nucleotideFrequencyAt() on it:

  > nucleotideFrequencyAt(v, c(1,4))
    A C G T
  A 0 1 0 0
  C 0 1 2 1
  G 0 0 2 2
  T 2 0 0 2

What you get here are the frequencies for the dinucleotides
obtained by picking the 1st and last nucleotides of each
view. Again, I'm not sure this is what you want...

Can you be more precise?

Thanks,
H.


thanks, jim

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to