Re: [Bioc-sig-seq] adapter removal

Victor Ruotti Fri, 30 Jan 2009 16:02:12 -0800

Maybe a not a straight topic related thing to adapter removal...However, do you guys have a simple way to split the fastq files onceloaded into biostrings?Say you load a whole lane and now want to split 12M reads and theirqualities into multiple fastq files from processing with Maq, gmap, etc?Perhaps this should be done outside or R? Just wanted to see what youthink...

See below
...

From the Grid Engine Life Sciense SIG


Can you share the program you use to split the reads?

I'm the process of writing a fasta/fastq/seq/prb splitter. I use Perland thought about starting a bioperl module for next gen stuff.They already have a bunch of modules to deal with qualities, so I wasthinking adding a method to split fastq files for maq/gmap processingwould be something good to have in bioperl. Great work had also beingdone with biostrings and maybe Martin can comment on this.

As simple as it might sound it would be good to have bioperl and maybebiostrings setup for this.

Will try to post this in the biostrings forum as well.
Any thoughts/interest on this?

Victor


On Jan 18, 2009, at 7:59 PM, Patrick Aboyoun wrote:

Kasper,
Yes, but there is between 12 - 36 delay between an svn checkin and apackage being available at bioconductor.org.
Patrick


Quoting Kasper Daniel Hansen <[email protected]>:
Shouldn't biocLite pick up recent additions to the subversion
repository, provided that you are using R-devel and you install using
pkgType = "source"?

Kasper

On Jan 17, 2009, at 19:24 , Patrick Aboyoun wrote:
Joe,
I have been making some modifications to trimLRPatterns bothtoday and in recent days, so you may need to get the latestversion of Biostrings directly from svn rather than usingbiocLite from within R. Once you have a recently sufficientversion, the key is in the construction of the max.Rmismatchargument. Below are some examples they achieve the result you arelooking for. The man page for trimLRPatterns has a detaileddescription on various types of inputs that are accepted by themax.Rmismatch argument.
suppressMessages(library(Biostrings))
Rpattern <- "CTGTAGGCACCA"
subjectSet <-
+ DNAStringSet(c("GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA",
+                "GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC"))
trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
+                max.Rmismatch = rep(2, 12))
A DNAStringSet instance of length 2
 width seq
[1]    22 GCTGGAACCCAGGGTGTTGTAC
[2]    24 GTAAGACCATACTTGGCCGAATGC
trimLRPatterns(Rpattern = Rpattern, subject = subjectSet,
+                max.Rmismatch = 0.2)
A DNAStringSet instance of length 2
 width seq
[1]    22 GCTGGAACCCAGGGTGTTGTAC
[2]    24 GTAAGACCATACTTGGCCGAATGC
sessionInfo()
R version 2.9.0 Under development (unstable) (2009-01-15 r47619)
i386-apple-darwin9.6.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.11.25 IRanges_1.1.34

loaded via a namespace (and not attached):
[1] grid_2.9.0         lattice_0.17-20    Matrix_0.999375-17


Patrick


Quoting joseph franklin <[email protected]>:
Patrick,

This adapter tool looks extremely useful for my purposes: removing
adapters from smRNA reads to estimate the short template lengths.
Forgive me if the answer to this is obvious, but everything seemsto
work with trimLRPatterns, except that it doesn't seem to allow the
Rpattern or Lpattern to slide along the sequence (at least usingthedefault settings--see below). Rather it looks only for exactmatches,
that leave no overhang.  Thus:
Rpattern <- "CTGTAGGCACCA"
trims:

[6]    34 GCTGGAACCCAGGGTGTTGTACCTGTAGGCACCA

nicely, to:

[6]    22 GCTGGAACCCAGGGTGTTGTAC


but a sequence where resulting in an Rpattern overhang (here ~2nt):

[90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC

is not trimmed at all:

[90]    34 GTAAGACCATACTTGGCCGAATGCCTGTAGGCAC
                                                   :

What can I do to allow for flexibility at the overhanging end?


Again, thanks very much.
Joe


On 14 Jan 2009, at 19:17, Patrick Aboyoun wrote:

I just checked in a trimLRPatterns function to the Bioconductor svn
repository for BioC 2.4. Its signature is

trimLRPatterns(Lpattern = NULL, Rpattern = NULL, subject,
            max.Lmismatch = 0, max.Rmismatch = 0,
            with.Lindels = FALSE, with.Rindels = FALSE,
            Lfixed = TRUE, Rfixed = TRUE, ranges = FALSE)
As you can infer from the arguments, this function allows theuser toset the # of mismatches (if with.*indels = FALSE) / edit distance(if
with.*indels = TRUE) for the left and right flanking "patterns". It
also allows for IUPAC ambiguity letters in these flanking regionsif*fixed = FALSE. When ranges = FALSE, trimLRPatterns returns thetrimmedstrings. When ranges = TRUE, it returns the ranges that you canuse to
trim the strings. Here are some examples:
Lpattern <- "TTCTGCTTG"
Rpattern <- "GATCGGAAG"
subject <- DNAString("TTCTGCTTGACGTGATCGGA")
subjectSet <- DNAStringSet(c("TGCTTGACGGCAGATCGG","TTCTGCTTGGATCGGAAG"))
trimLRPatterns(Lpattern = Lpattern, subject = subject)
11-letter "DNAString" instance
seq: ACGTGATCGGA
trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
subjectSet)
A DNAStringSet instance of length 2
width seq
[1]    18 TGCTTGACGGCAGATCGG
[2]     0
trimLRPatterns(Lpattern = Lpattern, Rpattern = Rpattern, subject =
subjectSet,
+                  ranges = TRUE)
IRanges object:
start end width
1     1  18    18
2    10   9     0

This functionality will be available on bioconductor.org (and
downloadable via biocLite) in the next day or so, but you canalso grabBiostrings from svn directly if you need it sooner. It will alsofeedits way into Biostrings documentation and training materialbefore the
next release of Bioconductor in May.


Patrick



Patrick Aboyoun wrote:
David,
Following up on Martin's comments, I am putting the finishingtouches on a function called trimLRPatterns for the Biostringspackage. Its purpose is to trim left and/or right flankingpatterns from sequences, so it can strip 5' and/or 3' adaptersfrom your reads. The signature for this function is
trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject,max.Lnedit=0, max.Rnedit=0,with.Lindels=FALSE, with.Rindels=FALSE,Lfixed=TRUE, Rfixed=TRUE,
           rangesOnly = FALSE)
I will be checking this function into the BioC 2.4 code line,which requires using R-devel, sometime today or tomorrow. Iwill send out an e-mail to this group when I check it in andshow a simple example of its usage. I talked with Martin andhe will wrap this functionality in the ShortRead layer so youdon't have to leave the ShortRead class system when removingadapters from your reads.
Cheers,
Patrick
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] adapter removal

Reply via email to