Re: [Bioc-sig-seq] adapter removal

James W. MacDonald Thu, 08 Jan 2009 13:28:04 -0800

As a follow-up question, once you have removed the adapter sequences isit then possible to write the ShortReadQ object back out in a fastqformat for aligning with e.g., bowtie? I know I can usewrite.XstringSet() to output in fasta format, but I would like to feedbowtie the quality scores as well.

Or is the speed reasonably comparable to use the PDict functions inBiostrings to do the aligning?


Best,

Jim



Martin Morgan wrote:

"David A.G" <[email protected]> writes:

Dear list,

I have some experience with Bioconductor but am newbie to this list and to NGS. 
I am trying to remove some adapters from my solexa s_N_sequence.txt file using 
Biostrings and ShortRead packages and the vignettes.  I managed to read in the 
text file and got to save the reads as follows

fqpattern <- "s_4_sequence.txt"
f4 <- file.path(analysisPath(sp), fqpattern)
fq4 <- readFastq(sp, fqpattern)
reads <- sread(fq4)  #"reads" contains more than 4 million 34-length fragments

Having the following adapter sequence:

adapter <- DNAString("ACGGATTGTTCAGT")

I tried to mimic the example in the Biostring vignette as follows:


myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")

but after more than two hours the process is still running.

I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with 4Gb RAM, 
and I only have some 30Mb free RAM left. I found a thread on adapter removal 
but does not clear things much to me, since as far as I understood the option 
mentioned in the thread is not appropriate (quote :(though apparently this is 
not entirely satisfactory, see the second entry!)).

Is this just a memory issue or am I doing something wrong? Shall I leave the 
process to run for longer?


As a bit of a follow-up, a reasonable strategy is to figure out how
long / how much memory the calculation takes on smaller chunks of
data.

Also, here I 'clean' (remove reads with 'N's) and then calculate the
srdistance to your adapter on a lane with 5978132 reads in about a
minute

fq <- readFastq(sp, "s_1_sequence.txt")
fq

class: ShortReadQ
length: 5978132 reads; width: 53 cycles

system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))

user system elapsed60.960 0.000 60.961

I might then tabulate the distances

table(dist[[1]])


and subset fq based on some criterion

fq[dist[[1]] > 4]

There's a second interface to this, too, using filters, e.g.,

fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]

Martin

TIA for your help,

Dave

_________________________________________________________________

Show them the way! Add maps and directions to your party invites.

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-5646
734-936-8662

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] adapter removal

Reply via email to