Re: [Bioc-sig-seq] adapter removal

Patrick Aboyoun Wed, 14 Jan 2009 09:37:30 -0800

David,

Following up on Martin's comments, I am putting the finishing touches ona function called trimLRPatterns for the Biostrings package. Its purposeis to trim left and/or right flanking patterns from sequences, so it canstrip 5' and/or 3' adapters from your reads. The signature for thisfunction is

trimLRPatterns(Lpattern=NULL, Rpattern=NULL, subject, max.Lnedit=0,max.Rnedit=0,with.Lindels=FALSE, with.Rindels=FALSE, Lfixed=TRUE,Rfixed=TRUE,

                rangesOnly = FALSE)

I will be checking this function into the BioC 2.4 code line, whichrequires using R-devel, sometime today or tomorrow. I will send out ane-mail to this group when I check it in and show a simple example of itsusage. I talked with Martin and he will wrap this functionality in theShortRead layer so you don't have to leave the ShortRead class systemwhen removing adapters from your reads.



Cheers,
Patrick




Martin Morgan wrote:

"David A.G" <[email protected]> writes:

hi Martin, thanks for your quick answer.
I can't get round to the fact that using srdistance() will remove the whole
read, instead of just editing the adapter off and keeping the remaining
fragment of the read, which it is useful information. That is the difference I
see between using srdistance() and the approach using pairwiseAlignment(). The
second, as it is shown in the Biostrings vignette using DNAStringSets, will
only edit the adapter sequence from the read. I may be wrong but the problem I
find is that I cannot work back from a DNAStringSet to a ShortReadQ object
that will include quality calls for the remaining bases of the edited reads,
and thus ending up with a full clean ShortReadQ object that can be transformed
back to fastq format.


Yes you're right the context I'm thinking of is that one wants to
remove reads ('filter') that have adapters, rather than removing the
adapter sequence ('edit'?). My thinking is that the reads are short
and numerous anyway, so little information is lost even when 10's of
thousands of reads are filtered, and the filtered reads are more
suspect than normal anyway because of the presence of the
adapter. This might be relatively naive.

I think Patrick's plan is to handle editing more comprehensively, but
there's still lots of flexibility. Here's a (limited testing!)
function that creates a ShortReadQ object by editing out those reads
with srdistance between 'x' and 'string' greater than 'threshold', for
a window onto x from 'start' to 'end'.

trim <-
    function(x, string, threshold, start, end)
{
    ShortReadQ <- function(id, sread, quality, ...)
        new("ShortReadQ", quality=quality, sread=sread, id=id, ...)
    sreads <- sread(x)
    d <- srdistance(DNAStringSet(sreads, start, end), string)[[1]]
    drop <- d <= threshold
    starts <- rep(1, length(x))
    starts[drop] <- end + 1
    ends <- width(sreads)
    ShortReadQ(id(x),
               DNAStringSet(sreads, starts, ends),
               SFastqQuality(BStringSet(quality(quality(x)), starts, ends)))
}

example(readFastq)

[snip]

fq <- trim(rfq, "GAG", 0, 1, 3)
fq

class: ShortReadQ
length: 256 reads; width: 36 33 cycles

sread(fq)[width(sread(fq)) == 33] # trimmed

  A DNAStringSet instance of length 15
     width seq
 [1]    33 AAGTTAATGGATGAATTGGCACAATGCTACAAT
 [2]    33 TTTGTATCTGTTACTGAGAAGTTAATGGATGAA
 [3]    33 GCTTGCGTTTATGGTACGCTGGTCTTTGTATGT
 [4]    33 GATAAATTATGTCTAATATTCAAACTTGCGCCG
 [5]    33 AAATAAAAGTCTGAAACATGATTAAACTCCTAA
 [6]    33 ATTATTTGTCTCCAGCCACTTAAGTGAGGTGAT
 [7]    33 CAGAAGCAATACCGCCAGCAATAGCACCAAACA
 [8]    33 TTTATTGCTGCCGTCATTGCTTATTATGTTCAT
 [9]    33 CTTCTCGAGCTGCGCAAGGATAGGTCGGATTTT
[10]    33 TTGTTCCATTCTTTAGCTCCTAGACCTTTATCA
[11]    33 CGTATGCCGCATGACCTTTCCCATCTTGGCTTC
[12]    33 TATCCTTTCCTTTATCAGCGGCAGACTTGCCCC
[13]    33 GAAGCATCAGCACCAGCACGCTCCCAAGCATTA
[14]    33 TGGTCGGCAGATTGCGCTAAACGGTCACATTAA
[15]    33 TTGTTCCATTCTTTAGCTCCTAGACCTTTAGCA

sread(rfq)[width(sread(fq)) == 33] # original

  A DNAStringSet instance of length 15
     width seq
 [1]    36 GAGAAGTTAATGGATGAATTGGCACAATGCTACAAT
 [2]    36 GAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAA
 [3]    36 GAGGCTTGCGTTTATGGTACGCTGGTCTTTGTATGT
 [4]    36 GAGGATAAATTATGTCTAATATTCAAACTTGCGCCG
 [5]    36 GAGAAATAAAAGTCTGAAACATGATTAAACTCCTAA
 [6]    36 GAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGAT
 [7]    36 GAGCAGAAGCAATACCGCCAGCAATAGCACCAAACA
 [8]    36 GAGTTTATTGCTGCCGTCATTGCTTATTATGTTCAT
 [9]    36 GAGCTTCTCGAGCTGCGCAAGGATAGGTCGGATTTT
[10]    36 GAGTTGTTCCATTCTTTAGCTCCTAGACCTTTATCA
[11]    36 GAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTC
[12]    36 GAGTATCCTTTCCTTTATCAGCGGCAGACTTGCCCC
[13]    36 GAGGAAGCATCAGCACCAGCACGCTCCCAAGCATTA
[14]    36 GAGTGGTCGGCAGATTGCGCTAAACGGTCACATTAA
[15]    36 GAGTTGTTCCATTCTTTAGCTCCTAGACCTTTAGCA

Regarding appending two ShortReadQ objects, the idea came up early in this
thread when I run out of computer memory and I was suggested to split the
original data so that I could do the adapter search in small pieces. I would
then have to put them back together. Also, another situation where needed is
merging two flowcell lanes (or more) coming from the same sample when you are
analyzing large genomes and do not have enough coverage on one single lane. I
guess this could be done outside R using cat.


Both of these sound pretty reasonable; others on the list might have
comments about the pros and cons of appending flow cell lanes. Thanks
for clarifying.

One typo fixed below...

Martin

Dave

To: [email protected]
CC: [email protected]; [email protected]
Subject: Re: [Bioc-sig-seq] adapter removal
From: [email protected]
Date: Wed, 14 Jan 2009 06:03:26 -0800

Hi Dave --

"David A.G" <[email protected]> writes:

Hi Martin and list,
going back to the adapter topic, I tried the srdistance() function and it
works really quick!

yeah it's great; Herve Pages & Patrick Aboyoun deserve credit for this.

I have a question regarding the edit distance. After having used tables()

find that the second top element is CTGGACTTTGTAGGATACCCTCGCTTTCCTGCTC

with

890 occurrences. The edit distance for this sequence is of course 0. But

if I

try only part of it as an adapter, CTGGACTTTGTAGGATA, the lowest edit

distance

is 17 and there are 14758 reads with this sequence.
Would you just use 17 as the distance cutoff?

If I wanted to find just those 890 occurrences in an object 'fq' with
the adapter at the start, I'd use srdistance on a DNAStringSet created
from just those nucleotides, along the lines of

d <- srdistance(DNAStringSet(sread(fq), 1, 17), "CTGGACTTTGTAGGATA")
fg[d != 0]


should be fg[d[[1]] != 0]

It's not really clear that the adapters will always be exactly at the
beginning, etc., so this might just be a starting point. Creating a
new DNAStringSet is not particularly memory- or CPU-intensive, so
creating it 'on the fly' as in the above example should be
reasonable. I think Patrick is working on a one-stop-shopping adapter
removal function in Biostrings.

Also,is it possible to merge two ShortReadQ objects, or, having subset an
ShortReadQ object in two by indexing, how do I put them back together?

The ability to 'append' two ShortReadQ objects was added just
yesterday

append(fq, fq)

It should be available in the 'devel' branch using biocLite after
about 1pm Seattle time, all being well; look for ShortRead version >=
1.1.33. It's possible to do this with the released version of
ShortRead, and I can provide off-line hints if you'd like, but I'd
recommend working with the development version of R and these
packages.

I'm curious to know the use case here; the need to append hasn't come
up in my own work.

Martin

Cheers,
Dave

To: [email protected]
CC: [email protected]; [email protected]
Subject: Re: [Bioc-sig-seq] adapter removal
From: [email protected]
Date: Fri, 9 Jan 2009 19:19:18 -0800

"David A.G" <[email protected]> writes:

Jim, thanks for pointing that out, I hadn´t faced that issue yet. The

"reads" object I created only had the sequences and not the qualities, and
after editing the adapters it is interesting to only have the quality

calls

for the remaining bases.

Tyler and Martin, thanks for the information on writing back to fastq.

There is a now an initial version of writeFastq in the 'devel' version
of ShortRead. It is memory and time efficient; try

example(writeFastq)
?writeFastq

please let me know if there are problems, or other missing parts of
ShortRead.

Martin

Dave

Date: Thu, 8 Jan 2009 15:12:01 -0800
From: [email protected]
To: [email protected]
Subject: Re: [Bioc-sig-seq] adapter removal

Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very

slow...

exportFastaQ <- function(shortReadQObject, fileName) {
## dump ShortReadQ object data to FASTQ format
entries <- length(shortReadQObject)
i <- 1
while (i <= entries){
cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="",
file=fileName, append=TRUE)
cat("\n", file=fileName, append=TRUE)
cat(as.character(sread(shortReadQObject)[i]), sep="",
file=fileName, append=TRUE)
cat("\n", file=fileName, append=TRUE)
cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="",
file=fileName, append=TRUE)
cat("\n", file=fileName, append=TRUE)
cat(as.character(quality(shortReadQObject)[[i]]), sep="",
file=fileName, append=TRUE)
cat("\n", file=fileName, append=TRUE)
i <- i + 1
}
}

Sincerely,
Tyler William H Backman
Bioinformatics Analyst
Institute for Integrative Genome Biology (IIGB)
Department of Botany and Plant Sciences
E-mail: [email protected]
1007 Noel T. Keen Hall
University of California
Riverside, CA 92521

Hi Jim --

"James W. MacDonald" <[email protected]> writes:

As a follow-up question, once you have removed the adapter

sequences

is it then possible to write the ShortReadQ object back out in a

fastq

format for aligning with e.g., bowtie? I know I can use
write.XstringSet() to output in fasta format, but I would like to

feed

bowtie the quality scores as well.

Actually a 'writeFastq' is on my 'do this week' list. In the mean

time

my best (untested) suggestion is along the lines of

n <- length(fq)
lines <- character(n)
lines[(seq_len(n)) * 4 - 3] <-
paste("@", as.character(id(fq)), sep="")
lines[(seq_len(n)) * 4 - 2] <-
as.character(sread(rfq1))
lines[(seq_len(n)) * 4 - 1] <-
paste("+", as.character(id(fq)), sep="")
lines[(seq_len(n)) * 4] <-
as.character(quality(quality(fq)))
writeLines(lines, "my.fastq")

This will be unnecessarily memory intensive.

Or is the speed reasonably comparable to use the PDict functions in
Biostrings to do the aligning?

Bowtie is very fast. One potential 'gotcha' is the orientation of
reads throughout the workflow -- Eland _sequence.txt files are the
reads as they come from the machine; MAQ (and I think Bowtie)

aligned

files report reads that align to the '-' strand in their reverse
complement sequence, with qualities reversed to match the reverse
compelement read, and with the alignment position recorded at the
'left-most' (rather than 5', say) end. Not that it will be a problem
in the current use case, just something to be aware of as you move
between tools.

Martin

Best,

Jim



Martin Morgan wrote:

"David A.G" <[email protected]> writes:

Dear list,

I have some experience with Bioconductor but am newbie to this

list

and to NGS. I am trying to remove some adapters from my solexa
s_N_sequence.txt file using Biostrings and ShortRead packages and

the

vignettes. I managed to read in the text file and got to save the
reads as follows

fqpattern <- "s_4_sequence.txt"
f4 <- file.path(analysisPath(sp), fqpattern)
fq4 <- readFastq(sp, fqpattern)
reads <- sread(fq4) #"reads" contains more than 4 million

34-length

fragments

Having the following adapter sequence:

adapter <- DNAString("ACGGATTGTTCAGT")

I tried to mimic the example in the Biostring vignette as

follows:

myAdapterAligns <- pairwiseAlignment(reads, adapter, type =

"overlap")

but after more than two hours the process is still running.

I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24)

with

4Gb RAM, and I only have some 30Mb free RAM left. I found a

thread

on

adapter removal but does not clear things much to me, since as

far

as

I understood the option mentioned in the thread is not

appropriate

(quote :(though apparently this is not entirely satisfactory, see

the

second entry!)).

Is this just a memory issue or am I doing something wrong? Shall

leave the process to run for longer?

As a bit of a follow-up, a reasonable strategy is to figure out

how

long / how much memory the calculation takes on smaller chunks of
data.
Also, here I 'clean' (remove reads with 'N's) and then calculate

the

srdistance to your adapter on a lane with 5978132 reads in about a
minute

fq <- readFastq(sp, "s_1_sequence.txt")
fq

class: ShortReadQ
length: 5978132 reads; width: 53 cycles

system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))

user system elapsed 60.960 0.000 60.961 I might then
tabulate the distances

table(dist[[1]])

and subset fq based on some criterion
fq[dist[[1]] > 4]
There's a second interface to this, too, using filters, e.g.,
fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
Martin

TIA for your help,

Dave

_________________________________________________________________
Show them the way! Add maps and directions to your party invites.

[[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

--
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-5646
734-936-8662

--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_________________________________________________________________
Show them the way! Add maps and directions to your party invites.

[[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

------------------------------------------------------------------------------

Get news, entertainment and everything you care about at Live.com. [[Check

it

out!]]

--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

------------------------------------------------------------------------------

check out the rest of the Windows Live™. More than mail–Windows Live™ goes way
beyond your inbox. More than messages


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] adapter removal

Reply via email to