Re: [Bioc-sig-seq] adapter removal

David A.G Wed, 14 Jan 2009 04:48:11 -0800

Hi Martin and list,

going back to the adapter topic, I tried the srdistance() function and it works 
really quick!
I have a question regarding the edit distance. After having used tables() I 
find that the second top element is CTGGACTTTGTAGGATACCCTCGCTTTCCTGCTC with 890 
occurrences. The edit distance for this sequence is of course 0. But if I try 
only part of it as an adapter, CTGGACTTTGTAGGATA, the lowest edit distance is 
17 and there are 14758 reads with this sequence.


Would you just use 17 as the distance cutoff? 

Also,is it possible to merge two ShortReadQ objects, or, having subset an 
ShortReadQ object in two by indexing, how do I put them back together?

Cheers,

Dave

> To: [email protected]
> CC: [email protected]; [email protected]
> Subject: Re: [Bioc-sig-seq] adapter removal
> From: [email protected]
> Date: Fri, 9 Jan 2009 19:19:18 -0800
> 
> "David A.G" <[email protected]> writes:
> 
> > Jim, thanks for pointing that out, I hadn´t faced that issue yet. The 
> > "reads" object I created only had the sequences and not the qualities, and 
> > after editing the adapters it is interesting to only have the quality calls 
> > for the remaining bases. 
> >
> > Tyler and Martin, thanks for the information on writing back to fastq.
> 
> There is a now an initial version of writeFastq in the 'devel' version
> of ShortRead. It is memory and time efficient; try
> 
> > example(writeFastq)
> > ?writeFastq
> 
> please let me know if there are problems, or other missing parts of
> ShortRead.
> 
> Martin
> 
> > Dave
> >
> >> Date: Thu, 8 Jan 2009 15:12:01 -0800
> >> From: [email protected]
> >> To: [email protected]
> >> Subject: Re: [Bioc-sig-seq] adapter removal
> >> 
> >> Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very 
> >> slow...
> >> 
> >> exportFastaQ <- function(shortReadQObject, fileName) {
> >>     ## dump ShortReadQ object data to FASTQ format
> >>     entries <- length(shortReadQObject)
> >>     i <- 1
> >>     while (i <= entries){
> >>         cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="",
> >> file=fileName, append=TRUE)
> >>         cat("\n", file=fileName, append=TRUE)
> >>         cat(as.character(sread(shortReadQObject)[i]), sep="",
> >> file=fileName, append=TRUE)
> >>         cat("\n", file=fileName, append=TRUE)
> >>         cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="",
> >> file=fileName, append=TRUE)
> >>         cat("\n", file=fileName, append=TRUE)
> >>         cat(as.character(quality(shortReadQObject)[[i]]), sep="",
> >> file=fileName, append=TRUE)
> >>         cat("\n", file=fileName, append=TRUE)
> >>                 i <- i + 1
> >>         }
> >> }
> >> 
> >> Sincerely,
> >> Tyler William H Backman
> >> Bioinformatics Analyst
> >> Institute for Integrative Genome Biology (IIGB)
> >> Department of Botany and Plant Sciences
> >> E-mail: [email protected]
> >> 1007 Noel T. Keen Hall
> >> University of California
> >> Riverside, CA 92521
> >> 
> >> > Hi Jim --
> >> >
> >> > "James W. MacDonald" <[email protected]> writes:
> >> >
> >> >> As a follow-up question, once you have removed the adapter sequences
> >> >> is it then possible to write the ShortReadQ object back out in a fastq
> >> >> format for aligning with e.g., bowtie? I know I can use
> >> >> write.XstringSet() to output in fasta format, but I would like to feed
> >> >> bowtie the quality scores as well.
> >> >
> >> > Actually a 'writeFastq' is on my 'do this week' list. In the mean time
> >> > my best (untested) suggestion is along the lines of
> >> >
> >> > n <- length(fq)
> >> > lines <- character(n)
> >> > lines[(seq_len(n)) * 4 - 3] <-
> >> >    paste("@", as.character(id(fq)), sep="")
> >> > lines[(seq_len(n)) * 4 - 2] <-
> >> >    as.character(sread(rfq1))
> >> > lines[(seq_len(n)) * 4 - 1] <-
> >> >     paste("+", as.character(id(fq)), sep="")
> >> > lines[(seq_len(n)) * 4] <-
> >> >     as.character(quality(quality(fq)))
> >> > writeLines(lines, "my.fastq")
> >> >
> >> > This will be unnecessarily memory intensive.
> >> >
> >> >> Or is the speed reasonably comparable to use the PDict functions in
> >> >> Biostrings to do the aligning?
> >> >
> >> > Bowtie is very fast. One potential 'gotcha' is the orientation of
> >> > reads throughout the workflow -- Eland _sequence.txt files are the
> >> > reads as they come from the machine; MAQ (and I think Bowtie) aligned
> >> > files report reads that align to the '-' strand in their reverse
> >> > complement sequence, with qualities reversed to match the reverse
> >> > compelement read, and with the alignment position recorded at the
> >> > 'left-most' (rather than 5', say) end. Not that it will be a problem
> >> > in the current use case, just something to be aware of as you move
> >> > between tools.
> >> >
> >> > Martin
> >> >
> >> >> Best,
> >> >>
> >> >> Jim
> >> >>
> >> >>
> >> >>
> >> >> Martin Morgan wrote:
> >> >>> "David A.G" <[email protected]> writes:
> >> >>>
> >> >>>> Dear list,
> >> >>>>
> >> >>>> I have some experience with Bioconductor but am newbie to this list
> >> >>>> and to NGS. I am trying to remove some adapters from my solexa
> >> >>>> s_N_sequence.txt file using Biostrings and ShortRead packages and the
> >> >>>> vignettes.  I managed to read in the text file and got to save the
> >> >>>> reads as follows
> >> >>>>
> >> >>>> fqpattern <- "s_4_sequence.txt"
> >> >>>> f4 <- file.path(analysisPath(sp), fqpattern)
> >> >>>> fq4 <- readFastq(sp, fqpattern)
> >> >>>> reads <- sread(fq4)  #"reads" contains more than 4 million 34-length
> >> >>>> fragments
> >> >>>>
> >> >>>> Having the following adapter sequence:
> >> >>>>
> >> >>>> adapter <- DNAString("ACGGATTGTTCAGT")
> >> >>>>
> >> >>>> I tried to mimic the example in the Biostring vignette as follows:
> >> >>>>
> >> >>>>
> >> >>>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")
> >> >>>>
> >> >>>> but after more than two hours the process is still running.
> >> >>>>
> >> >>>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with
> >> >>>> 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on
> >> >>>> adapter removal but does not clear things much to me, since as far as
> >> >>>> I understood the option mentioned in the thread is not appropriate
> >> >>>> (quote :(though apparently this is not entirely satisfactory, see the
> >> >>>> second entry!)).
> >> >>>>
> >> >>>> Is this just a memory issue or am I doing something wrong? Shall I
> >> >>>> leave the process to run for longer?
> >> >>> As a bit of a follow-up, a reasonable strategy is to figure out how
> >> >>> long / how much memory the calculation takes on smaller chunks of
> >> >>> data.
> >> >>> Also, here I 'clean' (remove reads with 'N's) and then calculate the
> >> >>> srdistance to your adapter on a lane with 5978132 reads in about a
> >> >>> minute
> >> >>>
> >> >>>> fq <- readFastq(sp, "s_1_sequence.txt")
> >> >>>> fq
> >> >>> class: ShortReadQ
> >> >>> length: 5978132 reads; width: 53 cycles
> >> >>>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
> >> >>>    user  system elapsed  60.960   0.000  60.961 I might then
> >> >>> tabulate the distances
> >> >>>
> >> >>>> table(dist[[1]])
> >> >>> and subset fq based on some criterion
> >> >>> fq[dist[[1]] > 4]
> >> >>> There's a second interface to this, too, using filters, e.g.,
> >> >>> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
> >> >>> Martin
> >> >>>
> >> >>>> TIA for your help,
> >> >>>>
> >> >>>> Dave
> >> >>>>
> >> >>>> _________________________________________________________________
> >> >>>> Show them the way! Add maps and directions to your party invites.
> >> >>>>
> >> >>>>       [[alternative HTML version deleted]]
> >> >>>>
> >> >>>> _______________________________________________
> >> >>>> Bioc-sig-sequencing mailing list
> >> >>>> [email protected]
> >> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >> >>>
> >> >>
> >> >> --
> >> >> James W. MacDonald, M.S.
> >> >> Biostatistician
> >> >> Hildebrandt Lab
> >> >> 8220D MSRB III
> >> >> 1150 W. Medical Center Drive
> >> >> Ann Arbor MI 48109-5646
> >> >> 734-936-8662
> >> >
> >> > --
> >> > Martin Morgan
> >> > Computational Biology / Fred Hutchinson Cancer Research Center
> >> > 1100 Fairview Ave. N.
> >> > PO Box 19024 Seattle, WA 98109
> >> >
> >> > Location: Arnold Building M2 B169
> >> > Phone: (206) 667-2793
> >> >
> >> > _______________________________________________
> >> > Bioc-sig-sequencing mailing list
> >> > [email protected]
> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >> >
> >> >
> >> 
> >> _______________________________________________
> >> Bioc-sig-sequencing mailing list
> >> [email protected]
> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >
> > _________________________________________________________________
> > Show them the way! Add maps and directions to your party invites. 
> >
> >     [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > [email protected]
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793

_________________________________________________________________


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] adapter removal

Reply via email to