"David A.G" <[email protected]> writes: > Jim, thanks for pointing that out, I hadn“t faced that issue yet. The "reads" > object I created only had the sequences and not the qualities, and after > editing the adapters it is interesting to only have the quality calls for the > remaining bases. > > Tyler and Martin, thanks for the information on writing back to fastq.
There is a now an initial version of writeFastq in the 'devel' version of ShortRead. It is memory and time efficient; try > example(writeFastq) > ?writeFastq please let me know if there are problems, or other missing parts of ShortRead. Martin > Dave > >> Date: Thu, 8 Jan 2009 15:12:01 -0800 >> From: [email protected] >> To: [email protected] >> Subject: Re: [Bioc-sig-seq] adapter removal >> >> Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very slow... >> >> exportFastaQ <- function(shortReadQObject, fileName) { >> ## dump ShortReadQ object data to FASTQ format >> entries <- length(shortReadQObject) >> i <- 1 >> while (i <= entries){ >> cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="", >> file=fileName, append=TRUE) >> cat("\n", file=fileName, append=TRUE) >> cat(as.character(sread(shortReadQObject)[i]), sep="", >> file=fileName, append=TRUE) >> cat("\n", file=fileName, append=TRUE) >> cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="", >> file=fileName, append=TRUE) >> cat("\n", file=fileName, append=TRUE) >> cat(as.character(quality(shortReadQObject)[[i]]), sep="", >> file=fileName, append=TRUE) >> cat("\n", file=fileName, append=TRUE) >> i <- i + 1 >> } >> } >> >> Sincerely, >> Tyler William H Backman >> Bioinformatics Analyst >> Institute for Integrative Genome Biology (IIGB) >> Department of Botany and Plant Sciences >> E-mail: [email protected] >> 1007 Noel T. Keen Hall >> University of California >> Riverside, CA 92521 >> >> > Hi Jim -- >> > >> > "James W. MacDonald" <[email protected]> writes: >> > >> >> As a follow-up question, once you have removed the adapter sequences >> >> is it then possible to write the ShortReadQ object back out in a fastq >> >> format for aligning with e.g., bowtie? I know I can use >> >> write.XstringSet() to output in fasta format, but I would like to feed >> >> bowtie the quality scores as well. >> > >> > Actually a 'writeFastq' is on my 'do this week' list. In the mean time >> > my best (untested) suggestion is along the lines of >> > >> > n <- length(fq) >> > lines <- character(n) >> > lines[(seq_len(n)) * 4 - 3] <- >> > paste("@", as.character(id(fq)), sep="") >> > lines[(seq_len(n)) * 4 - 2] <- >> > as.character(sread(rfq1)) >> > lines[(seq_len(n)) * 4 - 1] <- >> > paste("+", as.character(id(fq)), sep="") >> > lines[(seq_len(n)) * 4] <- >> > as.character(quality(quality(fq))) >> > writeLines(lines, "my.fastq") >> > >> > This will be unnecessarily memory intensive. >> > >> >> Or is the speed reasonably comparable to use the PDict functions in >> >> Biostrings to do the aligning? >> > >> > Bowtie is very fast. One potential 'gotcha' is the orientation of >> > reads throughout the workflow -- Eland _sequence.txt files are the >> > reads as they come from the machine; MAQ (and I think Bowtie) aligned >> > files report reads that align to the '-' strand in their reverse >> > complement sequence, with qualities reversed to match the reverse >> > compelement read, and with the alignment position recorded at the >> > 'left-most' (rather than 5', say) end. Not that it will be a problem >> > in the current use case, just something to be aware of as you move >> > between tools. >> > >> > Martin >> > >> >> Best, >> >> >> >> Jim >> >> >> >> >> >> >> >> Martin Morgan wrote: >> >>> "David A.G" <[email protected]> writes: >> >>> >> >>>> Dear list, >> >>>> >> >>>> I have some experience with Bioconductor but am newbie to this list >> >>>> and to NGS. I am trying to remove some adapters from my solexa >> >>>> s_N_sequence.txt file using Biostrings and ShortRead packages and the >> >>>> vignettes. I managed to read in the text file and got to save the >> >>>> reads as follows >> >>>> >> >>>> fqpattern <- "s_4_sequence.txt" >> >>>> f4 <- file.path(analysisPath(sp), fqpattern) >> >>>> fq4 <- readFastq(sp, fqpattern) >> >>>> reads <- sread(fq4) #"reads" contains more than 4 million 34-length >> >>>> fragments >> >>>> >> >>>> Having the following adapter sequence: >> >>>> >> >>>> adapter <- DNAString("ACGGATTGTTCAGT") >> >>>> >> >>>> I tried to mimic the example in the Biostring vignette as follows: >> >>>> >> >>>> >> >>>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap") >> >>>> >> >>>> but after more than two hours the process is still running. >> >>>> >> >>>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with >> >>>> 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on >> >>>> adapter removal but does not clear things much to me, since as far as >> >>>> I understood the option mentioned in the thread is not appropriate >> >>>> (quote :(though apparently this is not entirely satisfactory, see the >> >>>> second entry!)). >> >>>> >> >>>> Is this just a memory issue or am I doing something wrong? Shall I >> >>>> leave the process to run for longer? >> >>> As a bit of a follow-up, a reasonable strategy is to figure out how >> >>> long / how much memory the calculation takes on smaller chunks of >> >>> data. >> >>> Also, here I 'clean' (remove reads with 'N's) and then calculate the >> >>> srdistance to your adapter on a lane with 5978132 reads in about a >> >>> minute >> >>> >> >>>> fq <- readFastq(sp, "s_1_sequence.txt") >> >>>> fq >> >>> class: ShortReadQ >> >>> length: 5978132 reads; width: 53 cycles >> >>>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT")) >> >>> user system elapsed 60.960 0.000 60.961 I might then >> >>> tabulate the distances >> >>> >> >>>> table(dist[[1]]) >> >>> and subset fq based on some criterion >> >>> fq[dist[[1]] > 4] >> >>> There's a second interface to this, too, using filters, e.g., >> >>> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)] >> >>> Martin >> >>> >> >>>> TIA for your help, >> >>>> >> >>>> Dave >> >>>> >> >>>> _________________________________________________________________ >> >>>> Show them the way! Add maps and directions to your party invites. >> >>>> >> >>>> [[alternative HTML version deleted]] >> >>>> >> >>>> _______________________________________________ >> >>>> Bioc-sig-sequencing mailing list >> >>>> [email protected] >> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >>> >> >> >> >> -- >> >> James W. MacDonald, M.S. >> >> Biostatistician >> >> Hildebrandt Lab >> >> 8220D MSRB III >> >> 1150 W. Medical Center Drive >> >> Ann Arbor MI 48109-5646 >> >> 734-936-8662 >> > >> > -- >> > Martin Morgan >> > Computational Biology / Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N. >> > PO Box 19024 Seattle, WA 98109 >> > >> > Location: Arnold Building M2 B169 >> > Phone: (206) 667-2793 >> > >> > _______________________________________________ >> > Bioc-sig-sequencing mailing list >> > [email protected] >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> > >> > >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > _________________________________________________________________ > Show them the way! Add maps and directions to your party invites. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
