Krys,
Building up on Martin's suggestion, I also recommend you take a look at the new pairwiseAlignment function in the development version of Biostrings. Since it uses an O(N^2) dynamic programming algorithm it isn't necessarily suited for genome wide analysis, but it is perfectly suited for aligning short strings and filtered out short reads by adapter strings. The pairwiseAlignment function can fit global, local, and ends-free (overlap) alignments with or without quality-based substitution penalties and with or without indels.

Below is a sample script run that uses pairwiseAlignment to filter out short reads based on the adapter. This script looks at four different prefix cutoff levels (8, 16, 24, 36) that you could use to filter your data and presents the results you would get when you use the equivalent to a perfect match to the first 8 characters as your cutoff.


Patrick


> library(ShortRead)
> sp <- SolexaPath("/path/to/solexa/expt")
> pat <- "s_1_sequence.txt"
> shortReads <- readFastq(analysisPath(sp), pattern = pat)
> cleanedShortReads <- clean(shortReads) # no reads with uncalled bases
> cleanedShortReads
class: ShortReadQ
length: 3506447 reads; width: 35 cycles
> adapter <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA")
> system.time({
+   adapterScores <-
+ pairwiseAlignment(pattern = sread(cleanedShortReads), subject = adapter, + patternQuality = quality(quality(cleanedShortReads)),
+                       subjectQuality = 99L,
+                       qualityType = "Solexa",
+                       type = "overlap",
+                       scoreOnly = TRUE)
+ })
  user  system elapsed
92.834   2.040  94.873
> firstNChar <- c(8, 16, 24, 32)
> cutoffScores <- c(0, 0, 0, 0)
> names(cutoffScores) <- firstNChar
> for (i in seq_len(length(cutoffScores))) {
+ print(system.time({
+   cutoffScores[i] <-
+     max(pairwiseAlignment(pattern = sread(cleanedShortReads),
+                           subject = substring(adapter, 1, firstNChar[i]),
+ patternQuality = quality(quality(cleanedShortReads)),
+                           subjectQuality = 99L,
+                           qualityType = "Solexa",
+                           type = "overlap",
+                           scoreOnly = TRUE))
+ }))
+ }
  user  system elapsed
16.901   0.480  17.381
  user  system elapsed
36.886   0.804  37.687
  user  system elapsed
58.692   1.412  60.117
  user  system elapsed
75.805   1.812  77.617
> cutoffScores
      8       16       24       32
15.96356 31.92712 47.89068 63.82726
> quantile(adapterScores, seq(0.9, 1, 0.005))
90% 90.5% 91% 91.5% 92% 92.5% 93% 93.5% 9.909755 11.882821 13.878695 17.341880 21.739067 35.019755 50.056933 55.612625 94% 94.5% 95% 95.5% 96% 96.5% 97% 97.5% 57.737140 58.951218 59.979969 60.780243 61.931897 62.586998 63.861232 65.047516
     98%     98.5%       99%     99.5%      100%
66.623512 68.858673 69.453423 69.661290 69.773109
> t(do.call("rbind", lapply(cutoffScores, function(x) table(adapterScores > x))))
           8      16      24      32
FALSE 3206960 3240999 3257733 3400456
TRUE   299487  265448  248714  105991
> tables(cleanedShortReads[adapterScores >= cutoffScores["8"]], n = 10)[["top"]]
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGA
                             63558                               62264
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTGA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGA
                             17832                               15939
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTATA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTAA
                             11554                                7391
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGTA
                              6452                                6343
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTTTA GATCGGAAGAGCTCGTATGACGTCTTCTGCTTAGA
                              5724                                3878
> sessionInfo()
R version 2.8.0 Under development (unstable) (2008-07-07 r46041)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] tools stats graphics grDevices utils datasets methods [8] base
other attached packages:
[1] ShortRead_0.1.32 lattice_0.17-10 Biostrings_2.9.42 Biobase_2.1.7
loaded via a namespace (and not attached):
[1] grid_2.8.0




Martin Morgan wrote:
"Krys Kelly" <[EMAIL PROTECTED]> writes:

I have inherited a pipeline for Solexa sequence data using Perl, Bioperl,
SSAHA and mySQL.  As an R/Bioconducter user I am interested in ShortRead and
BiostringsCinterfaceDemo.

However, in the short term I need to use the current pipeline.  The imaging
is done by the Sequencing Facility and we get fastq files with the 3'
adapter still attached. The adapter removal is currently done by a Perl
script which just keeps sequences which match any number of letters in
[ACGT] followed by the first 8 letters of the adapter.  This seems pretty
crude (e.g. only using 8 letters, not allowing for mismatches, not allowing
for the diminishing quality along the length of the read).

In the very latest Biostrings, and thanks to Herve's skills, you can

library(ShortRead)
sp <- SolexaPath("/path/to/solexa/expt")
pat <- "s_1_0001_seq.txt" # or any regexp to read multiple files
seq <- readFastq(analysisPath(sp), patreadXStringColumns(baseCallPath(sp), pat,
+                           list(NULL, NULL, NULL, NULL, "DNAString"))[[1]]
seq <- clean(seq) # no reads with uncalled bases

(or readFastq on analysisPath(sp), "s_1_sequence.txt", for instance,
but these are 'filtered' and it's not really clear that that is the
right place for a careful analysis to start), and then

tables(seq)$top[1:5]
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGAT 186 126 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAT 71 51 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAA 16 to remind me what the primer sequence looks like (!) and

pd <- PDict(seq, max.mismatch=3)
subj <- DNAString("GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA")
cnts <- countPDict(pd, subj, max.mismatch=3)

cnts now contains the number of times each read in seq matched the
primer with up to 3 mismatches (could be more, within reason), of
which there are

sum(cnts!=0)
[1] 565

seq[cnts==0] gets me the non-primer sequences (though apparently this
is not entirely satisfactory, see the second entry!)

tables(seq[cnts==0])$top[1:5]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GATCGGAAGAGCTCGGTATGCCGTCTTCTGCTTAGA 71 12 GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG 5 5 CACCTTTTTCAGTTTTCCTCGCCATATTTCACGTCC 4
sessionInfo() tells me

other attached packages:
[1] ShortRead_0.1.32 lattice_0.17-10 Biostrings_2.9.42 Biobase_2.1.7
Google has not revealed any algorithms or code for this part of the
pipeline.  Does anyone know what algorithms are being used or, even better,
could anyone point me in the direction of some code?

I guess you're asking in general. In terms of ELAND, I'm not really
sure what it does about primers; I think it actually keeps them in,
relying on failure to align to weed them out. With contaminants (which
can be specified in a separate file) ELAND aligns the reads to the
contaminants and discards matches. The code is 'Solexa Open Source' or
some such, and should be available from your friendly Solexa rep; the
contaminant stuff is in Gerald/GERALD.pl of the version of the code I
have access to.

There are some obvious additional problems, e.g., those poly-A
reads. Tools in Biostrings (e.g., alphabetFrequency) can be used in
these cases, too.

All of this seems pretty ad hoc, which is not really what you were
looking for, I suspect.

Martin

Thanks

Krys


Dr Krystyna A Kelly
Senior Research Associate
David Baulcombe Group

Department of Plant Sciences
University of Cambridge
Downing Street
Cambridge CB2 3EA
United Kingdom

Tel: +44 (0)1223 333 915
Fax: +44 (0)1223 333953

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to