Hi Dave -- "David A.G" <[email protected]> writes:
> Hi Martin and list, > going back to the adapter topic, I tried the srdistance() function and it > works really quick! yeah it's great; Herve Pages & Patrick Aboyoun deserve credit for this. > I have a question regarding the edit distance. After having used tables() I > find that the second top element is CTGGACTTTGTAGGATACCCTCGCTTTCCTGCTC with > 890 occurrences. The edit distance for this sequence is of course 0. But if I > try only part of it as an adapter, CTGGACTTTGTAGGATA, the lowest edit distance > is 17 and there are 14758 reads with this sequence. > Would you just use 17 as the distance cutoff? If I wanted to find just those 890 occurrences in an object 'fq' with the adapter at the start, I'd use srdistance on a DNAStringSet created from just those nucleotides, along the lines of > d <- srdistance(DNAStringSet(sread(fq), 1, 17), "CTGGACTTTGTAGGATA") > fg[d != 0] It's not really clear that the adapters will always be exactly at the beginning, etc., so this might just be a starting point. Creating a new DNAStringSet is not particularly memory- or CPU-intensive, so creating it 'on the fly' as in the above example should be reasonable. I think Patrick is working on a one-stop-shopping adapter removal function in Biostrings. > Also,is it possible to merge two ShortReadQ objects, or, having subset an > ShortReadQ object in two by indexing, how do I put them back together? The ability to 'append' two ShortReadQ objects was added just yesterday > append(fq, fq) It should be available in the 'devel' branch using biocLite after about 1pm Seattle time, all being well; look for ShortRead version >= 1.1.33. It's possible to do this with the released version of ShortRead, and I can provide off-line hints if you'd like, but I'd recommend working with the development version of R and these packages. I'm curious to know the use case here; the need to append hasn't come up in my own work. Martin > Cheers, > Dave >> To: [email protected] >> CC: [email protected]; [email protected] >> Subject: Re: [Bioc-sig-seq] adapter removal >> From: [email protected] >> Date: Fri, 9 Jan 2009 19:19:18 -0800 >> >> "David A.G" <[email protected]> writes: >> >> > Jim, thanks for pointing that out, I hadn“t faced that issue yet. The > "reads" object I created only had the sequences and not the qualities, and > after editing the adapters it is interesting to only have the quality calls > for the remaining bases. >> > >> > Tyler and Martin, thanks for the information on writing back to fastq. >> >> There is a now an initial version of writeFastq in the 'devel' version >> of ShortRead. It is memory and time efficient; try >> >> > example(writeFastq) >> > ?writeFastq >> >> please let me know if there are problems, or other missing parts of >> ShortRead. >> >> Martin >> >> > Dave >> > >> >> Date: Thu, 8 Jan 2009 15:12:01 -0800 >> >> From: [email protected] >> >> To: [email protected] >> >> Subject: Re: [Bioc-sig-seq] adapter removal >> >> >> >> Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very > slow... >> >> >> >> exportFastaQ <- function(shortReadQObject, fileName) { >> >> ## dump ShortReadQ object data to FASTQ format >> >> entries <- length(shortReadQObject) >> >> i <- 1 >> >> while (i <= entries){ >> >> cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="", >> >> file=fileName, append=TRUE) >> >> cat("\n", file=fileName, append=TRUE) >> >> cat(as.character(sread(shortReadQObject)[i]), sep="", >> >> file=fileName, append=TRUE) >> >> cat("\n", file=fileName, append=TRUE) >> >> cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="", >> >> file=fileName, append=TRUE) >> >> cat("\n", file=fileName, append=TRUE) >> >> cat(as.character(quality(shortReadQObject)[[i]]), sep="", >> >> file=fileName, append=TRUE) >> >> cat("\n", file=fileName, append=TRUE) >> >> i <- i + 1 >> >> } >> >> } >> >> >> >> Sincerely, >> >> Tyler William H Backman >> >> Bioinformatics Analyst >> >> Institute for Integrative Genome Biology (IIGB) >> >> Department of Botany and Plant Sciences >> >> E-mail: [email protected] >> >> 1007 Noel T. Keen Hall >> >> University of California >> >> Riverside, CA 92521 >> >> >> >> > Hi Jim -- >> >> > >> >> > "James W. MacDonald" <[email protected]> writes: >> >> > >> >> >> As a follow-up question, once you have removed the adapter sequences >> >> >> is it then possible to write the ShortReadQ object back out in a fastq >> >> >> format for aligning with e.g., bowtie? I know I can use >> >> >> write.XstringSet() to output in fasta format, but I would like to feed >> >> >> bowtie the quality scores as well. >> >> > >> >> > Actually a 'writeFastq' is on my 'do this week' list. In the mean time >> >> > my best (untested) suggestion is along the lines of >> >> > >> >> > n <- length(fq) >> >> > lines <- character(n) >> >> > lines[(seq_len(n)) * 4 - 3] <- >> >> > paste("@", as.character(id(fq)), sep="") >> >> > lines[(seq_len(n)) * 4 - 2] <- >> >> > as.character(sread(rfq1)) >> >> > lines[(seq_len(n)) * 4 - 1] <- >> >> > paste("+", as.character(id(fq)), sep="") >> >> > lines[(seq_len(n)) * 4] <- >> >> > as.character(quality(quality(fq))) >> >> > writeLines(lines, "my.fastq") >> >> > >> >> > This will be unnecessarily memory intensive. >> >> > >> >> >> Or is the speed reasonably comparable to use the PDict functions in >> >> >> Biostrings to do the aligning? >> >> > >> >> > Bowtie is very fast. One potential 'gotcha' is the orientation of >> >> > reads throughout the workflow -- Eland _sequence.txt files are the >> >> > reads as they come from the machine; MAQ (and I think Bowtie) aligned >> >> > files report reads that align to the '-' strand in their reverse >> >> > complement sequence, with qualities reversed to match the reverse >> >> > compelement read, and with the alignment position recorded at the >> >> > 'left-most' (rather than 5', say) end. Not that it will be a problem >> >> > in the current use case, just something to be aware of as you move >> >> > between tools. >> >> > >> >> > Martin >> >> > >> >> >> Best, >> >> >> >> >> >> Jim >> >> >> >> >> >> >> >> >> >> >> >> Martin Morgan wrote: >> >> >>> "David A.G" <[email protected]> writes: >> >> >>> >> >> >>>> Dear list, >> >> >>>> >> >> >>>> I have some experience with Bioconductor but am newbie to this list >> >> >>>> and to NGS. I am trying to remove some adapters from my solexa >> >> >>>> s_N_sequence.txt file using Biostrings and ShortRead packages and > the >> >> >>>> vignettes. I managed to read in the text file and got to save the >> >> >>>> reads as follows >> >> >>>> >> >> >>>> fqpattern <- "s_4_sequence.txt" >> >> >>>> f4 <- file.path(analysisPath(sp), fqpattern) >> >> >>>> fq4 <- readFastq(sp, fqpattern) >> >> >>>> reads <- sread(fq4) #"reads" contains more than 4 million 34-length >> >> >>>> fragments >> >> >>>> >> >> >>>> Having the following adapter sequence: >> >> >>>> >> >> >>>> adapter <- DNAString("ACGGATTGTTCAGT") >> >> >>>> >> >> >>>> I tried to mimic the example in the Biostring vignette as follows: >> >> >>>> >> >> >>>> >> >> >>>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = > "overlap") >> >> >>>> >> >> >>>> but after more than two hours the process is still running. >> >> >>>> >> >> >>>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with >> >> >>>> 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread > on >> >> >>>> adapter removal but does not clear things much to me, since as far > as >> >> >>>> I understood the option mentioned in the thread is not appropriate >> >> >>>> (quote :(though apparently this is not entirely satisfactory, see > the >> >> >>>> second entry!)). >> >> >>>> >> >> >>>> Is this just a memory issue or am I doing something wrong? Shall I >> >> >>>> leave the process to run for longer? >> >> >>> As a bit of a follow-up, a reasonable strategy is to figure out how >> >> >>> long / how much memory the calculation takes on smaller chunks of >> >> >>> data. >> >> >>> Also, here I 'clean' (remove reads with 'N's) and then calculate the >> >> >>> srdistance to your adapter on a lane with 5978132 reads in about a >> >> >>> minute >> >> >>> >> >> >>>> fq <- readFastq(sp, "s_1_sequence.txt") >> >> >>>> fq >> >> >>> class: ShortReadQ >> >> >>> length: 5978132 reads; width: 53 cycles >> >> >>>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT")) >> >> >>> user system elapsed 60.960 0.000 60.961 I might then >> >> >>> tabulate the distances >> >> >>> >> >> >>>> table(dist[[1]]) >> >> >>> and subset fq based on some criterion >> >> >>> fq[dist[[1]] > 4] >> >> >>> There's a second interface to this, too, using filters, e.g., >> >> >>> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)] >> >> >>> Martin >> >> >>> >> >> >>>> TIA for your help, >> >> >>>> >> >> >>>> Dave >> >> >>>> >> >> >>>> _________________________________________________________________ >> >> >>>> Show them the way! Add maps and directions to your party invites. >> >> >>>> >> >> >>>> [[alternative HTML version deleted]] >> >> >>>> >> >> >>>> _______________________________________________ >> >> >>>> Bioc-sig-sequencing mailing list >> >> >>>> [email protected] >> >> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> >>> >> >> >> >> >> >> -- >> >> >> James W. MacDonald, M.S. >> >> >> Biostatistician >> >> >> Hildebrandt Lab >> >> >> 8220D MSRB III >> >> >> 1150 W. Medical Center Drive >> >> >> Ann Arbor MI 48109-5646 >> >> >> 734-936-8662 >> >> > >> >> > -- >> >> > Martin Morgan >> >> > Computational Biology / Fred Hutchinson Cancer Research Center >> >> > 1100 Fairview Ave. N. >> >> > PO Box 19024 Seattle, WA 98109 >> >> > >> >> > Location: Arnold Building M2 B169 >> >> > Phone: (206) 667-2793 >> >> > >> >> > _______________________________________________ >> >> > Bioc-sig-sequencing mailing list >> >> > [email protected] >> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> > >> >> > >> >> >> >> _______________________________________________ >> >> Bioc-sig-sequencing mailing list >> >> [email protected] >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> > >> > _________________________________________________________________ >> > Show them the way! Add maps and directions to your party invites. >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioc-sig-sequencing mailing list >> > [email protected] >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> -- >> Martin Morgan >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M2 B169 >> Phone: (206) 667-2793 > > ------------------------------------------------------------------------------ > > Get news, entertainment and everything you care about at Live.com. [[Check it > out!]] -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
