Re: [Bioc-sig-seq] Low-complexity read filtering/trimming

Hervé Pagès Mon, 23 Feb 2009 13:32:21 -0800

A small correction to my previous algorithm used for computing the
scores. In my previous email I suggested this:


  tnf <- trinucleotideFrequency2(dict0)
  scores <- rowSums(tnf * tnf)

but I think it would be more accurate and closer to the DUST
approach to start penalizing a read when it contains triplets
with frequency >= 2 i.e. the first time a triplet occurs in a
read should not count. This leads to the following code:

  tnf2 <- tnf - 1L
  tnf2[tnf2 < 0] <- 0L
  scores <- rowSums(tnf2 * tnf2)

  > summary(scores)
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     0.00    8.00   12.00   24.59   19.00 1024.00

Now whatever the length of the reads is, they have a chance
to obtain a score of 0, which happens when all triplets in
the read are distinct. This new score is more meaningful IMO.

H.


Hervé Pagès wrote:

Hi Cei,

Here is a suggestion for question 1).

DUST seems to have been designed to mask regions in long DNA sequences.
They count tri-nucleotide frequencies in a 64-nucleotide sliding windows
so, strictly speaking, the algo doesn't work anymore with short reads.
In addition your problem seems simpler to me, because, if I understand
correctly, you don't want to mask regions within the reads, but just
want to keep or drop the entire read based on its "complexity".
The notion of "complexity" as defined by DUST could be reused though, by
applying trinucleotideFrequency() to the DNAStringSet object containing
the reads, replacing all coefficients in the resulting matrix by its
square, applying rowSums() to it, and finally using the resulting score
to filter the reads (high score meaning low complexity):

  library(ShortRead)
  rfq <- readFastq("some_path", pattern="s_1_sequence.txt")
  dict0 <- sread(rfq)

  tnf <- trinucleotideFrequency(dict0)
  scores <- rowSums(tnf * tnf)

  table(scores)  # choose a cut-off value
  clean_dict <- dict0[scores < cut_off]

There is one technical problem though: the top-level loop in
trinucleotideFrequency() is still written in R so that won't work
for millions of reads. The temporary workaround is to implement
a vcountPDict-based version of trinucleotideFrequency():

  trinucleotideFrequency2 <- function(x)
  {
    all_triplets <- DNAStringSet(mkAllStrings(c("A", "C", "G", "T"), 3))
    pd <- PDict(all_triplets)
    t(vcountPDict(pd, x))
  }

trinucleotideFrequency2() will take about 20 seconds on a
4-million 35-mers DNAStringSet object.

I calculated the scores for a set of 4.47 millions unaligned
35-mers from a single lane of a real Solexa experiment and got:

  > summary(scores)
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    22.00   49.00   55.00   66.95   63.00 1089.00

  > dict0[scores >= 200]
    A DNAStringSet instance of length 63331
          width seq
      [1]    35 AAAAAACCAAAACCAAACAAATAAAAACCCCAAAT
      [2]    35 GATAGATAGATAGATAGATAGATAGGGTAGATAGG
      [3]    35 GTGTGTGTGTGTGTGTGTGTATGTGTGTGTGCGTT
      [4]    35 GGTAGGTAGGTAGGTCAGGTAGGTAGGTAGGTAGG
      [5]    35 TGGTTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
      [6]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
      [7]    35 GTGTGTCTGTGTGTCTGTGTGTCTGTGTGTGTCTG
      [8]    35 GTATGTGTGCTTTGTGTGTGTGGTGTGTGTGTGGT
      [9]    35 GAGAGTGTGTGTGTGAGAGAGAGTGTGTGTGTGAG
      ...   ... ...
  [63323]    35 AAAAAAAAAAAAAAAAAAAAACGAAGAAAAAAAAG
  [63324]    35 AAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAACA
  [63325]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  [63326]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  [63327]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  [63328]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  [63329]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  [63330]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  [63331]    35 AAAAAAAGAAAAAAAAAAAAAAAAGAGAAAAAAAA

Note that 1089 is the score obtained by the poly-As, poly-Cs,
etc... (1089 = 33^2, this is the highest possible score for a
35-mer).

H.

Cei Abreu-Goodger wrote:

Hi all,
I've been playing around with some Solexa small-RNA reads usingShortRead and Biostrings. I've used the 'trimLRPatterns' function toremove adapter sequence, and I've been trying to remove low-complexitysequences with 'srFilter'. I would first really like to congratulateall the people involved for the great work. There are two situationsin which I would be grateful for some suggestions, though:
1) I have many "low-complexity" reads. Some are simply polyA, polyC,etc. But some others are runs of "ATATAT" or "CACACACA", etc.Previously I would have used "dust" on the command line to filter outthis kind of read in a fasta file. Any ideas on how to achieve similarfunctionality in the ShortRead world?
2) For some reads I may have a "N-rich" patch inside the read, forexample:
AATAAAGTGCTTACAGTGNNNNTNNATNCAATACCG
I would ideally like to trim of everything starting at the "N-rich"part. I was trying to implement something with 'vmatchPattern', but ifI allow for mismatches (for a more flexible search) I will also gethits starting before the run of Ns.
Many thanks,

Cei



sessionInfo()

R version 2.9.0 Under development (unstable) (2009-02-13 r47919)
i386-apple-darwin9.6.0

locale:
C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] ShortRead_1.1.39 lattice_0.17-20 BSgenome_1.11.9Biostrings_2.11.28
[5] IRanges_1.1.38     Biobase_2.3.10

loaded via a namespace (and not attached):
[1] Matrix_0.999375-20 grid_2.9.0


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Low-complexity read filtering/trimming

Reply via email to