Re: [Bioc-sig-seq] adapter removal

Patrick Aboyoun Sat, 17 Jan 2009 19:26:01 -0800

Joe,

I have been making some modifications to trimLRPatterns both today andin recent days, so you may need to get the latest version ofBiostrings directly from svn rather than using biocLite from within R.Once you have a recently sufficient version, the key is in theconstruction of the max.Rmismatch argument. Below are some examplesthey achieve the result you are looking for. The man page fortrimLRPatterns has a detailed description on various types of inputsthat are accepted by the max.Rmismatch argument.

suppressMessages(library(Biostrings))
Rpattern <- "CTGTAGGCACCA"
subjectSet <-

+ DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
+                "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))

trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,

+                max.Rmismatch = rep(2, 12))
  A DNAStringSet instance of length 2
    width seq
[1]    22 GCTGGAACCCAGGGTGTTGTAC
[2]    24 GTAAGACCATACTTGGCCGAATGC

trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,

+                max.Rmismatch = 0.2)
  A DNAStringSet instance of length 2
    width seq
[1]    22 GCTGGAACCCAGGGTGTTGTAC
[2]    24 GTAAGACCATACTTGGCCGAATGC

sessionInfo()

R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
i386-apple-darwin9.6.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.11.25 IRanges_1.1.34

loaded via a namespace (and not attached):
[1] grid_2.9.0         lattice_0.17-20    Matrix_0.999375-17


Patrick


Quoting joseph franklin <[email protected]>:

Patrick,

This adapter tool looks extremely useful for my purposes: removing
adapters from smRNA reads to estimate the short template lengths.
Forgive me if the answer to this is obvious, but everything seems to
work with trimLRPatterns, except that it doesn't seem to allow the
Rpattern or Lpattern to slide along the sequence (at least using the
default settings--see below).  Rather it looks only for exact matches,
that leave no overhang.  Thus:

Rpattern <- "CTGTAGGCACCA"


trims:

 [6]    34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA

nicely, to:

 [6]    22 GCTGGAACCCAGGGTGTTGTAC


but a sequence where resulting in an Rpattern overhang (here ~2nt):

[90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC

is not trimmed at all:

[90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
                                                      :

What can I do to allow for flexibility at the overhanging end?


Again, thanks very much.
Joe


On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:

I just checked in a trimLRPatterns function to the Bioconductor svn
repository for BioC 2.4. Its signature is

trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
               max.Lmismatch = 0, max.Rmismatch = 0,
               with.Lindels = FALSE, with.Rindels = FALSE,
               Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)

As you can infer from the arguments, this function allows the user to
set the # of mismatches (if with.*indels = FALSE) / edit distance (if
with.*indels = TRUE) for the left and right flanking "patterns". It
also allows for IUPAC ambiguity letters in these flanking regions if
*fixed = FALSE. When ranges = FALSE, trimLRPatterns returns the trimmed
strings. When ranges = TRUE, it returns the ranges that you can use to
trim the strings. Here are some examples:

  Lpattern <- "TTCTGCTTG"
  Rpattern <- "GATCGGAAG"
  subject <- DNAString("TTCTGCTTGACGTGATCGGA")
  subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG", "TTCTGCTTGGATCGGAAG"))
  trimLRPatterns(Lpattern = Lpattern, subject = subject)

11-letter "DNAString" instance
seq: ACGTGATCGGA

  trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =

subjectSet)
A DNAStringSet instance of length 2
  width seq
[1]    18 TGCTTGACGGCAGATCGG
[2]     0

  trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =

subjectSet,
+                  ranges = TRUE)
IRanges object:
start end width
1     1  18    18
2    10   9     0

This functionality will be available on bioconductor.org (and
downloadable via biocLite) in the next day or so, but you can also grab
Biostrings from svn directly if you need it sooner. It will also feed
its way into Biostrings documentation and training material before the
next release of Bioconductor in May.


Patrick



Patrick Aboyoun wrote:

David,
Following up on Martin's comments, I am putting the finishingtouches on a function called trimLRPatterns for the Biostringspackage. Its purpose is to trim left and/or right flanking patternsfrom sequences, so it can strip 5' and/or 3' adapters from yourreads. The signature for this function is
trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject, max.Lnedit=0,max.Rnedit=0,with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,Rfixed=TRUE,
              rangesOnly = FALSE)
I will be checking this function into the BioC 2.4 code line, whichrequires using R-devel, sometime today or tomorrow. I will sendout an e-mail to this group when I check it in and show a simpleexample of its usage. I talked with Martin and he will wrap thisfunctionality in the ShortRead layer so you don't have to leave theShortRead class system when removing adapters from your reads.
Cheers,
Patrick


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] adapter removal

Reply via email to