David,
Following up on Martin's comments, I am putting the finishing touches on a function called trimLRPatterns for the Biostrings package. Its purpose is to trim left and/or right flanking patterns from sequences, so it can strip 5' and/or 3' adapters from your reads. The signature for this function is

trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject, max.Lnedit=0, max.Rnedit=0, with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE, Rfixed=TRUE,
                rangesOnly = FALSE)

I will be checking this function into the BioC 2.4 code line, which requires using R-devel, sometime today or tomorrow. I will send out an e-mail to this group when I check it in and show a simple example of its usage. I talked with Martin and he will wrap this functionality in the ShortRead layer so you don't have to leave the ShortRead class system when removing adapters from your reads.


Cheers,
Patrick




Martin Morgan wrote:
"David A.G" <[email protected]> writes:

hi Martin, thanks for your quick answer.
I can't get round to the fact that using srdistance() will remove the whole
read, instead of just editing the adapter off and keeping the remaining
fragment of the read, which it is useful information. That is the difference I
see between using srdistance() and the approach using pairwiseAlignment(). The
second, as it is shown in the Biostrings vignette using DNAStringSets, will
only edit the adapter sequence from the read. I may be wrong but the problem I
find is that I cannot work back from a DNAStringSet to a ShortReadQ object
that will include quality calls for the remaining bases of the edited reads,
and thus ending up with a full clean ShortReadQ object that can be transformed
back to fastq format.

Yes you're right the context I'm thinking of is that one wants to
remove reads ('filter') that have adapters, rather than removing the
adapter sequence ('edit'?). My thinking is that the reads are short
and numerous anyway, so little information is lost even when 10's of
thousands of reads are filtered, and the filtered reads are more
suspect than normal anyway because of the presence of the
adapter. This might be relatively naive.

I think Patrick's plan is to handle editing more comprehensively, but
there's still lots of flexibility. Here's a (limited testing!)
function that creates a ShortReadQ object by editing out those reads
with srdistance between 'x' and 'string' greater than 'threshold', for
a window onto x from 'start' to 'end'.

trim <-
    function(x, string, threshold, start, end)
{
    ShortReadQ <- function(id, sread, quality, ...)
        new("ShortReadQ", quality=quality, sread=sread, id=id, ...)
    sreads <- sread(x)
    d <- srdistance(DNAStringSet(sreads, start, end), string)[[1]]
    drop <- d <= threshold
    starts <- rep(1, length(x))
    starts[drop] <- end + 1
    ends <- width(sreads)
    ShortReadQ(id(x),
               DNAStringSet(sreads, starts, ends),
               SFastqQuality(BStringSet(quality(quality(x)), starts, ends)))
}

example(readFastq)
[snip]
fq <- trim(rfq, "GAG", 0, 1, 3)
fq
class: ShortReadQ
length: 256 reads; width: 36 33 cycles
sread(fq)[width(sread(fq)) == 33] # trimmed
  A DNAStringSet instance of length 15
     width seq
 [1]    33 AAGTTAATGGATGAATTGGCACAATGCTACAAT
 [2]    33 TTTGTATCTGTTACTGAGAAGTTAATGGATGAA
 [3]    33 GCTTGCGTTTATGGTACGCTGGTCTTTGTATGT
 [4]    33 GATAAATTATGTCTAATATTCAAACTTGCGCCG
 [5]    33 AAATAAAAGTCTGAAACATGATTAAACTCCTAA
 [6]    33 ATTATTTGTCTCCAGCCACTTAAGTGAGGTGAT
 [7]    33 CAGAAGCAATACCGCCAGCAATAGCACCAAACA
 [8]    33 TTTATTGCTGCCGTCATTGCTTATTATGTTCAT
 [9]    33 CTTCTCGAGCTGCGCAAGGATAGGTCGGATTTT
[10]    33 TTGTTCCATTCTTTAGCTCCTAGACCTTTATCA
[11]    33 CGTATGCCGCATGACCTTTCCCATCTTGGCTTC
[12]    33 TATCCTTTCCTTTATCAGCGGCAGACTTGCCCC
[13]    33 GAAGCATCAGCACCAGCACGCTCCCAAGCATTA
[14]    33 TGGTCGGCAGATTGCGCTAAACGGTCACATTAA
[15]    33 TTGTTCCATTCTTTAGCTCCTAGACCTTTAGCA
sread(rfq)[width(sread(fq)) == 33] # original
  A DNAStringSet instance of length 15
     width seq
 [1]    36 GAGAAGTTAATGGATGAATTGGCACAATGCTACAAT
 [2]    36 GAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAA
 [3]    36 GAGGCTTGCGTTTATGGTACGCTGGTCTTTGTATGT
 [4]    36 GAGGATAAATTATGTCTAATATTCAAACTTGCGCCG
 [5]    36 GAGAAATAAAAGTCTGAAACATGATTAAACTCCTAA
 [6]    36 GAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGAT
 [7]    36 GAGCAGAAGCAATACCGCCAGCAATAGCACCAAACA
 [8]    36 GAGTTTATTGCTGCCGTCATTGCTTATTATGTTCAT
 [9]    36 GAGCTTCTCGAGCTGCGCAAGGATAGGTCGGATTTT
[10]    36 GAGTTGTTCCATTCTTTAGCTCCTAGACCTTTATCA
[11]    36 GAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTC
[12]    36 GAGTATCCTTTCCTTTATCAGCGGCAGACTTGCCCC
[13]    36 GAGGAAGCATCAGCACCAGCACGCTCCCAAGCATTA
[14]    36 GAGTGGTCGGCAGATTGCGCTAAACGGTCACATTAA
[15]    36 GAGTTGTTCCATTCTTTAGCTCCTAGACCTTTAGCA

Regarding appending two ShortReadQ objects, the idea came up early in this
thread when I run out of computer memory and I was suggested to split the
original data so that I could do the adapter search in small pieces. I would
then have to put them back together. Also, another situation where needed is
merging two flowcell lanes (or more) coming from the same sample when you are
analyzing large genomes and do not have enough coverage on one single lane. I
guess this could be done outside R using cat.

Both of these sound pretty reasonable; others on the list might have
comments about the pros and cons of appending flow cell lanes. Thanks
for clarifying.

One typo fixed below...

Martin

Dave
To: [email protected]
CC: [email protected]; [email protected]
Subject: Re: [Bioc-sig-seq] adapter removal
From: [email protected]
Date: Wed, 14 Jan 2009 06:03:26 -0800

Hi Dave --

"David A.G" <[email protected]> writes:

Hi Martin and list,
going back to the adapter topic, I tried the srdistance() function and it
works really quick!
yeah it's great; Herve Pages & Patrick Aboyoun deserve credit for this.

I have a question regarding the edit distance. After having used tables()
I
find that the second top element is CTGGACTTTGTAGGATACCCTCGCTTTCCTGCTC
with
890 occurrences. The edit distance for this sequence is of course 0. But
if I
try only part of it as an adapter, CTGGACTTTGTAGGATA, the lowest edit
distance
is 17 and there are 14758 reads with this sequence.
Would you just use 17 as the distance cutoff?
If I wanted to find just those 890 occurrences in an object 'fq' with
the adapter at the start, I'd use srdistance on a DNAStringSet created
from just those nucleotides, along the lines of

d <- srdistance(DNAStringSet(sread(fq), 1, 17), "CTGGACTTTGTAGGATA")
fg[d != 0]

should be fg[d[[1]] != 0]

It's not really clear that the adapters will always be exactly at the
beginning, etc., so this might just be a starting point. Creating a
new DNAStringSet is not particularly memory- or CPU-intensive, so
creating it 'on the fly' as in the above example should be
reasonable. I think Patrick is working on a one-stop-shopping adapter
removal function in Biostrings.

Also,is it possible to merge two ShortReadQ objects, or, having subset an
ShortReadQ object in two by indexing, how do I put them back together?
The ability to 'append' two ShortReadQ objects was added just
yesterday

append(fq, fq)
It should be available in the 'devel' branch using biocLite after
about 1pm Seattle time, all being well; look for ShortRead version >=
1.1.33. It's possible to do this with the released version of
ShortRead, and I can provide off-line hints if you'd like, but I'd
recommend working with the development version of R and these
packages.

I'm curious to know the use case here; the need to append hasn't come
up in my own work.

Martin

Cheers,
Dave
To: [email protected]
CC: [email protected]; [email protected]
Subject: Re: [Bioc-sig-seq] adapter removal
From: [email protected]
Date: Fri, 9 Jan 2009 19:19:18 -0800

"David A.G" <[email protected]> writes:

Jim, thanks for pointing that out, I hadn´t faced that issue yet. The
"reads" object I created only had the sequences and not the qualities, and
after editing the adapters it is interesting to only have the quality
calls
for the remaining bases.
Tyler and Martin, thanks for the information on writing back to fastq.
There is a now an initial version of writeFastq in the 'devel' version
of ShortRead. It is memory and time efficient; try

example(writeFastq)
?writeFastq
please let me know if there are problems, or other missing parts of
ShortRead.

Martin

Dave

Date: Thu, 8 Jan 2009 15:12:01 -0800
From: [email protected]
To: [email protected]
Subject: Re: [Bioc-sig-seq] adapter removal

Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very
slow...
exportFastaQ <- function(shortReadQObject, fileName) {
## dump ShortReadQ object data to FASTQ format
entries <- length(shortReadQObject)
i <- 1
while (i <= entries){
cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="",
file=fileName, append=TRUE)
cat("\n", file=fileName, append=TRUE)
cat(as.character(sread(shortReadQObject)[i]), sep="",
file=fileName, append=TRUE)
cat("\n", file=fileName, append=TRUE)
cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="",
file=fileName, append=TRUE)
cat("\n", file=fileName, append=TRUE)
cat(as.character(quality(shortReadQObject)[[i]]), sep="",
file=fileName, append=TRUE)
cat("\n", file=fileName, append=TRUE)
i <- i + 1
}
}

Sincerely,
Tyler William H Backman
Bioinformatics Analyst
Institute for Integrative Genome Biology (IIGB)
Department of Botany and Plant Sciences
E-mail: [email protected]
1007 Noel T. Keen Hall
University of California
Riverside, CA 92521

Hi Jim --

"James W. MacDonald" <[email protected]> writes:

As a follow-up question, once you have removed the adapter
sequences
is it then possible to write the ShortReadQ object back out in a
fastq
format for aligning with e.g., bowtie? I know I can use
write.XstringSet() to output in fasta format, but I would like to
feed
bowtie the quality scores as well.
Actually a 'writeFastq' is on my 'do this week' list. In the mean
time
my best (untested) suggestion is along the lines of

n <- length(fq)
lines <- character(n)
lines[(seq_len(n)) * 4 - 3] <-
paste("@", as.character(id(fq)), sep="")
lines[(seq_len(n)) * 4 - 2] <-
as.character(sread(rfq1))
lines[(seq_len(n)) * 4 - 1] <-
paste("+", as.character(id(fq)), sep="")
lines[(seq_len(n)) * 4] <-
as.character(quality(quality(fq)))
writeLines(lines, "my.fastq")

This will be unnecessarily memory intensive.

Or is the speed reasonably comparable to use the PDict functions in
Biostrings to do the aligning?
Bowtie is very fast. One potential 'gotcha' is the orientation of
reads throughout the workflow -- Eland _sequence.txt files are the
reads as they come from the machine; MAQ (and I think Bowtie)
aligned
files report reads that align to the '-' strand in their reverse
complement sequence, with qualities reversed to match the reverse
compelement read, and with the alignment position recorded at the
'left-most' (rather than 5', say) end. Not that it will be a problem
in the current use case, just something to be aware of as you move
between tools.

Martin

Best,

Jim



Martin Morgan wrote:
"David A.G" <[email protected]> writes:

Dear list,

I have some experience with Bioconductor but am newbie to this
list
and to NGS. I am trying to remove some adapters from my solexa
s_N_sequence.txt file using Biostrings and ShortRead packages and
the
vignettes. I managed to read in the text file and got to save the
reads as follows

fqpattern <- "s_4_sequence.txt"
f4 <- file.path(analysisPath(sp), fqpattern)
fq4 <- readFastq(sp, fqpattern)
reads <- sread(fq4) #"reads" contains more than 4 million
34-length
fragments

Having the following adapter sequence:

adapter <- DNAString("ACGGATTGTTCAGT")

I tried to mimic the example in the Biostring vignette as
follows:
myAdapterAligns <- pairwiseAlignment(reads, adapter, type =
"overlap")
but after more than two hours the process is still running.

I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24)
with
4Gb RAM, and I only have some 30Mb free RAM left. I found a
thread
on
adapter removal but does not clear things much to me, since as
far
as
I understood the option mentioned in the thread is not
appropriate
(quote :(though apparently this is not entirely satisfactory, see
the
second entry!)).

Is this just a memory issue or am I doing something wrong? Shall
I
leave the process to run for longer?
As a bit of a follow-up, a reasonable strategy is to figure out
how
long / how much memory the calculation takes on smaller chunks of
data.
Also, here I 'clean' (remove reads with 'N's) and then calculate
the
srdistance to your adapter on a lane with 5978132 reads in about a
minute

fq <- readFastq(sp, "s_1_sequence.txt")
fq
class: ShortReadQ
length: 5978132 reads; width: 53 cycles
system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
user system elapsed 60.960 0.000 60.961 I might then
tabulate the distances

table(dist[[1]])
and subset fq based on some criterion
fq[dist[[1]] > 4]
There's a second interface to this, too, using filters, e.g.,
fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
Martin

TIA for your help,

Dave

_________________________________________________________________
Show them the way! Add maps and directions to your party invites.

[[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-5646
734-936-8662
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_________________________________________________________________
Show them the way! Add maps and directions to your party invites.

[[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793
------------------------------------------------------------------------------
Get news, entertainment and everything you care about at Live.com. [[Check
it
out!]]
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793
------------------------------------------------------------------------------

check out the rest of the Windows Live™. More than mail–Windows Live™ goes way
beyond your inbox. More than messages



_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to