Hi Stephen -- Some details on PDict are in the Biostrings package, especially
?PDict ?matchPDict ?"matchPDict-inexact" PDict is a 'dictionary' of reads. It allows for a 'trusted band'. The trusted band is a (constant-width) region (e.g., the first 12 nucleotides of each read in the PDict) that the user has special trust in, e.g., as a highly reliable part of the sequence. matchPDict / countPDict process the PDict by matching exactly the trusted band, and allowing mismatches (NOT indels) outside the trusted band. A little more under the hood, a PDict is an implementation of an Aho-Corasick automaton. The entire dictionary is processed in a single pass along the 'subject' (e.g., reference genome); I think the processing time for exact matches is on the order of the length of the subject + the width (not number of entries!) of the trusted band. The current strategy to allow for mismatches is more brute-force, scanning read and subject from exact matches for the trusted band. Variants on the PDict / matchPDict theme are still being developed, e.g., tailored for SNP identification. Again, this is Herve Page's work. Hope that helps, Martin Quoting Stephen Henderson <[EMAIL PROTECTED]>: > Hi > This all soungs very promising Im very impressed by the speed of these > operations. > > Can you give a little more detail about the underlying structure of PDict and > how it 'matches' sequences? > > > ________________________________ > > From: [EMAIL PROTECTED] on behalf of Martin Morgan > Sent: Wed 02/04/2008 04:20 > To: [email protected] > Subject: [Bioc-sig-seq] Bioc short read directions > > > > Short-readers! > > I want to take this opportunity to provide an update on the directions > we've initiated for short reads. It would be great to get your feedback, > and to hear of your own efforts. > > WARNING: This software is in development, and is far from final. In > particular, functions in the BiostringsCinterfaceDemo package are NOT > meant to be final or for use in other packages; they're here to > demonstrate 'proof-of-concept' and to illustrate how users can access > the Biostrings package from C code. I'll indicate which functions are > from BiostringsCinterfaceDemo below. Expect a short-read 'base' package > to materialize in the devel branch of Bioconductor in the not too > distant future. > > WARNING: The data used to illustrate functionality are not meant to be > indiciative of the quality of data produced by Solexa; they are > generally 'first runs' that present a number of interesting challenges > for interpretation. > > HARDWARE AND SOFTWARE: The following include timing and object size > measurements. The machine being used is fast, but we're not doing > anything fancy to, e.g., exploit multiple processors. The machine has a > very large amount of memory; we used about 10 GB below, looking at three > different data sets. The following uses the R-2-7-branch (this is > different from R-devel). The Biostrings and BiostringsCinterfaceDemo > packages are updated very regularly, so be prepared for broken or > outdated functions. > > Herve Pages is responsible for the clever code; I am just a scribe. > > Ok, first a convenience function to print out 'size' in megabytes, > 'cause objects are large! > > > mb <- function(sz) noquote(sprintf("%.1f MB", sz / (1024^2))) > > > * Starters... > > We load the BiostringsCinterfaceDemo, which requires Biostrings. Both of > these need to be from the 'development' branch of Bioconductor. Both are > changing rapidly, and should be obtained from svn and updated regularly > (http://wiki.fhcrc.org/bioc/DeveloperPage, > http://wiki.fhcrc.org/bioc/SvnHowTo). > > > library(BiostringsCinterfaceDemo) > > > * I/O, DNAStringSet, and alphabetFrequency > > We next read in a fasta file derived from a lane of solexa reads. Here's > what the data looks like: > > > readLines(fastaFile, 10) > [1] ">5_1_102_368" > [2] "TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT" > [3] ">5_1_120_254" > [4] "TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT" > [5] ">5_1_110_385" > [6] "GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC" > [7] ">5_1_118_88" > [8] "GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC" > [9] ">5_1_113_327" > [10] "GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC" > > This is a single lane from a Solexa training run; the data are not > filtered, and the run is not meant to be representative in terms of > quality or other characteristics. The DNA used for the reads is from > phage phiX-174. Here's how we read it in (countLines and readSolexaFastA > are in BiostringsCinterfaceDemo) > > > countLines(fastaFile) > s_5.fasta > 18955056 > > system.time({ > + seqa <- readSolexaFastA(fastaFile) > + }, gcFirst=TRUE) > user system elapsed > 67.48 2.08 69.68 > > mb(object.size(seqa)) > [1] 397.7 MB > > seqa > A DNAStringSet instance of length 9477528 > width seq > [1] 36 TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT > [2] 36 TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT > [3] 36 GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC > [4] 36 GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC > [5] 36 GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC > [6] 36 TGACCCTCAGCAATCTTAAACTTCTTAGACGAATCA > [7] 36 GCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGC > [8] 36 TTTAGGTGTCTGTAAAACAGGTGCCGAAGAAGCTGG > [9] 36 GGTCTGTTGAACACGACCAGAAAACTGGCCTAACGA > ... ... ... > [9477520] 36 TACGCAGTTTTGCCGTATACTCGTTGTTCTGACTCT > [9477521] 36 TATACCCCCCCTCCTACTTGTGCTGTTTCTCATGTT > [9477522] 36 CAGGTTGTTTCTGTTGGTGCTGATATTTCTTTTTTT > [9477523] 36 GTCTTCCTTGCTTGTCAGATTGGTCGTCTTATTACC > [9477524] 36 ATACGAAAGACCAGGTATATGCACAAAATGAGTTGC > [9477525] 36 ACCACAAACGCGCTCGTTTATGCTTGCCTCTATTAC > [9477526] 36 ------------------------------------ > [9477527] 36 CCAGCAAGGAAGCCAAGATGGGAAAGGTCATGCGGC > [9477528] 36 CATTGTTGACCACCTACATACCACAGACGAGCACCT > > It takes just over a minute to read in the nearly 9.5 million reads. The > reads are stored efficiently, without the overhead of R character > strings. The data structure (a DNAStringSet, from Biostrings and > therefore stable) will not copy the large data, but instead contains > 'views' into it. > > A basic question is about the nucleotides present in the reads. > alphabetFrequency (from Biostrings) scans all the sequences and tallies > nucleotides > > > system.time({ > + alf1 <- alphabetFrequency(seqa, collapse=TRUE, freq=TRUE) > + }, gcFirst=TRUE) > user system elapsed > 0.612 0.000 0.612 > > alf1 > A C G T M R W S Y K > 0.2449 0.2119 0.2201 0.3030 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 > V H D B N - > 0.0000 0.0000 0.0000 0.0000 0.0000 0.0201 > > (bases are recored with IUPAC symbols, see ?IUPAC_CODE_MAP). This > executes very efficiently. A variant produces a matrix with rows > corresponding to reads and columns to bases. > > > alf <- alphabetFrequency(seqa, baseOnly=TRUE) > > dim(alf) > [1] 9477528 5 > > head(alf) > A C G T other > [1,] 9 4 6 17 0 > [2,] 11 6 5 14 0 > [3,] 10 9 4 13 0 > [4,] 4 9 12 11 0 > [5,] 6 6 11 13 0 > [6,] 12 10 4 10 0 > > This can be remarkably useful. For instance, to select just the 'clean' > sequences (those without ambiguous base calls), one can > > > cleanSeqs <- seqa[alf[,"other"]==0] > > length(seqa) > [1] 9477528 > > length(cleanSeqs) > [1] 9207292 > > This creates a new DNAStringSet with just the clean sequences. It > executes very quickly, because the DNAStringSet is a view into the > original. The memory associated with the reads themselves is not copied. > here is the alphabetFrequency of the 'clean' reads. > > > cleanAlf <- alphabetFrequency(cleanSeqs, baseOnly=TRUE) > > Again this is very useful, for instance the distribution of GC content > among clean reads is > > > plot(density(rowSums(cleanAlf[,c("G", "C")]) / rowSums(cleanAlf))) > > > * PDict, countPDict, matchPDict > > A 'PDict' (defined in Biostrings) is a dictionary-like structure that > can be used for very efficient exact- and partially-exact matching > algorithms. To illustrate, we'll use data from about a million reads of > the Solexa BAC cloning vector. These reads again come from an early run > on the Solexa instrumentation here, and results should not be taken to > be representative of performance. > > We read and clean the sequences as above, resulting in > > > length(cleanSeqs) > [1] 923680 > > We then create a PDict from our DNAStringSet with > > > system.time({ > + pDict <- PDict(cleanSeqs) > + }, gcFirst=TRUE) > user system elapsed > 1.09 0.00 1.10 > > pDict > 923680-pattern constant width PDict object of width 25 (patterns have no > names) > > mb(object.size(pDict)) > [1] 160.4 MB > > This is created quickly. It is a larger object, but the size allows fast > searches. Here we'll use Biostrings readFASTA to read in the sequence to > which the data are to be aligned. > > > bac <- read.DNAStringSet(bacFile, "fasta")[[1]] > Read 2479 items > > length(bac) > [1] 173427 > > This is a BAC clone. We'll match our pDict to the BAC subject, finding > all EXACT matches; > > > system.time({ > + counts <- countPDict(pDict, bac) > + }, gcFirst=FALSE) > user system elapsed > 0.200 0.048 0.268 > > length(counts) > [1] 923680 > > table(counts)[1:5]/sum(table(counts)) > counts > 0 1 2 3 4 > 0.53954 0.42738 0.01136 0.00528 0.00334 > > This is very fast, partly because the subject against which the PDict is > being matched is short. A more realistic use case is to match against a > genome. The integration with BSgenome packages is very smooth. To match > our pDict against human chromosome 6 (where the BAC cloning vector comes > from), we can load the appropriate BSgenome package and chromosome > > > library(BSgenome.Hsapiens.UCSC.hg18) > > > Hsapiens[["chr6"]] > 170899992-letter "DNAString" instance > seq: > NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN > > (The N's represent the chromsome telomeres, whose seuqences is not > available). We look for exact matches > > > system.time({ > + hcount <- countPDict(pDict, Hsapiens[["chr6"]]) > + }, gcFirst=TRUE) > user system elapsed > 24.2 0.0 24.2 > > table(hcount)[1:5] / sum(table(hcount)) > hcount > 0 1 2 3 4 > 0.50466 0.32996 0.02043 0.00959 0.00761 > > About 1/2 the sequences exactly match one or more locations on > chromosome 6. Some sequences match many times, though the reason (in > this case, anyway) is not too surprising: > > > max(hcount) > [1] 8286 > > maxIdx = which(hcount==max(hcount)) > > cleanSeqs[maxIdx] > A DNAStringSet instance of length 70 > width seq > [1] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [2] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [3] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [4] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [5] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [6] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [7] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [8] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [9] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > ... ... ... > [62] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [63] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [64] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [65] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [66] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [67] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [68] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [69] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > [70] 25 TTTTTTTTTTTTTTTTTTTTTTTTT > > (that these are identical can be checked with > unique(as.character(cleanSeqs[maxIdx]))) > > The current implementation of PDict allows for a 'trusted band' of > nucleotides that need to match exactly, allowing for approximate matches > in the remaining nucleotides. Here we trust the first 12 bases, and > allow up to 3 mismatches. Also, we use the more informative matchPDict, > which allows us to find the positions of all matches > > > trusted <- PDict(cleanSeqs, tb.end=12) > > system.time({ > + mmMatch <- matchPDict(trusted, Hsapiens[[6]], max.mismatch=3) > + }, gcFirst=TRUE) > user system elapsed > 195.58 3.85 199.44 > > table(cut(countIndex(mmMatch), c(0, 10^(0:5)), right=FALSE)) > > [0,1) [1,10) [10,100) [100,1e+03) [1e+03,1e+04) > 377233 337806 72983 66322 60818 > [1e+04,1e+05) > 8518 > > Execution time increases as the stringency of the match decreases. PDict > facilities do not yet incorporate quality scores, but filtering results > based on quality (qualities are discussed further below) represents a > natural direction for development. > > > > * alphabetByCycle > > alphabetByCycle uses a small C function in BiostringsCinterfaceDemo, > alphabet_by_cycle. It and the read* functions in this package illustrate > how to access DNAStringSet objects at the C level. alphabetByCycle is a > matrix that tallies nucleotide use per cycle > > > system.time({ > + abc <- alphabetByCycle(seqa) > + }) > user system elapsed > 2.68 0.00 2.68 > > Again this can be quite useful. For instance, we can find out the number > of bases that were not called, as a function of cycle > > > abc["-",] > [1] 22181 108829 123173 180382 225091 225055 216787 208538 208881 > [10] 213104 148936 142966 141030 148163 178747 204304 211538 211303 > [19] 213722 211167 208021 208715 165441 158359 147110 151462 204781 > [28] 221008 223922 221171 227622 226936 232232 236606 242387 241718 > > and the number of 'T' nucleotides as a function of cycle > > > abc["T",] / colSums(abc[1:4,]) > [1] 0.286 0.292 0.292 0.284 0.292 0.280 0.293 0.297 0.299 0.287 0.290 > [12] 0.294 0.294 0.299 0.300 0.301 0.304 0.305 0.306 0.309 0.310 0.311 > [23] 0.312 0.316 0.315 0.319 0.320 0.322 0.326 0.328 0.331 0.335 0.339 > [34] 0.342 0.346 0.358 > > That's quite a striking increase after cycle 25! > > > * Qualities > > Solexa reads have quality scores associated with each base call. These > are summarized in files formatted like: > > > readLines(fastqFile, n=8) > [1] "@HWI-EAS88_1_1_1_1001_499" > [2] "GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT" > [3] "+HWI-EAS88_1_1_1_1001_499" > [4] "]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS" > [5] "@HWI-EAS88_1_1_1_898_392" > [6] "GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC" > [7] "+HWI-EAS88_1_1_1_898_392" > [8] "]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK" > > A record consists of four lines. The first and third lines are > identifiers (repeated), the second line is the sequence, the fourth line > an ASCII character representing the score (']' is good, Z is better than > A). Here we read a quality file into a data structure defined in > BiostringsCinterfaceDemo designed to coordinate the sequence, name, and > quality information. > > > system.time({ > + seqq <- readSolexaFastQ(fastqFile) > + }, gcFirst=TRUE) > user system elapsed > 17.533 0.417 17.969 > > mb(object.size(seqq)) > [1] 254.5 MB > > seqq > class: SolexaSequenceQ > length: 2218237 > > > sequences(seqq)[1:5] > A DNAStringSet instance of length 5 > width seq > [1] 36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT > [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC > [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT > [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA > [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT > > names(seqq)[1:5] > A BStringSet instance of length 5 > width seq > [1] 24 HWI-EAS88_1_1_1_1001_499 > [2] 23 HWI-EAS88_1_1_1_898_392 > [3] 23 HWI-EAS88_1_1_1_922_465 > [4] 23 HWI-EAS88_1_1_1_895_493 > [5] 23 HWI-EAS88_1_1_1_953_493 > > scores(seqq)[1:5] > A BStringSet instance of length 5 > width seq > [1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS > [2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK > [3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA > [4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA > [5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS > > The object returned by readSolexaFastQ is an S4 object with three slots. > Each slot contains a XStringSet, where 'X' is DNA for sequences, and > 'BString' for names and scores. This represents one way of structuring > quality data; the S4 class coordinates subsetting, and provides > (read-only) accessors to the underlying objects. A likely addition to > this class as it matures is the inclusion of lane-specific phenotype > (sample) information, much as an ExpressionSet coordinates sample and > expression values. > > We can gain some basic insight into the sequences as before, e.g., > > > abc <- alphabetByCycle(sequences(seqq)) > > abc["N",] > [1] 0 0 0 0 0 0 0 0 0 0 0 > [12] 0 1213 1631 1155 1240 721 418 8503 526 6493 703 > [23] 14999 718 1623 737 243 40704 811 590 1964 961 809 > [34] 910 477 208 > > Solexa provides files with quality information after filtering reads > based on 'purity', a measure that precludes uncertain bases (IUPAC code > 'N') from a user-specified region (the first 12 cycles, by default). > > > abc["T",] / colSums(abc[1:4,]) > [1] 0.298 0.290 0.292 0.290 0.295 0.279 0.292 0.298 0.301 0.293 0.295 > [12] 0.299 0.296 0.293 0.294 0.294 0.293 0.295 0.294 0.297 0.297 0.299 > [23] 0.298 0.303 0.302 0.307 0.312 0.312 0.321 0.331 0.334 0.351 0.361 > [34] 0.374 0.380 0.423 > > We can also summarize quality information by cycle, using an alphabet > that reflects the encoded scores: > > > alphabet <- sapply(as.raw(32:93), rawToChar) > > abcScore <- alphabetByCycle(scores(seqq), alphabet=alphabet) > > > > rowSums(abcScore) > ! " # $ % & ' > 0 0 0 0 0 0 0 0 > ( ) * + , - . / > 0 0 0 0 0 0 0 0 > 0 1 2 3 4 5 6 7 > 0 0 0 0 0 0 0 0 > 8 9 : ; < = > ? > 0 0 0 0 0 0 0 0 > @ A B C D E F G > 0 1367337 0 2035064 0 1006372 819775 0 > H I J K L M N O > 1926112 116395 1614109 356496 517731 1602298 578172 2016009 > P Q R S T U V W > 3043393 401196 2152160 1553511 2149231 254108 3896558 232409 > X Y Z [ \\ ] > 706452 4353706 1292309 732210 0 45133419 > > abcScore[34:62,c(1:4, 33:36)] > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] > A 1 1788 1798 8 260944 299631 332500 354717 > B 0 0 0 0 0 0 0 0 > C 59 36426 12705 227 39 131268 140461 154643 > D 0 0 0 0 0 0 0 0 > E 133 13391 4033 180 120135 0 0 0 > F 0 0 0 0 0 259549 272238 287988 > G 0 0 0 0 0 0 0 0 > H 272 8815 2933 357 242741 234088 239917 239157 > I 0 0 0 0 116395 0 0 0 > J 472 4678 1656 586 0 195991 198114 187535 > K 5 59 25 11 110052 84345 83744 77353 > L 0 0 0 0 102438 144459 142049 128785 > M 75 171 121 92 93945 119209 116492 103765 > N 653 2992 1232 769 85940 91924 89084 79246 > O 1227 2719 1754 1576 175403 187339 176997 160100 > P 65834 57830 60855 65863 61910 0 0 0 > Q 0 0 0 0 236290 0 0 0 > R 3211 4482 4321 4215 0 0 0 0 > S 0 0 0 0 68378 470434 426641 444948 > T 4444 5373 5868 5637 0 0 0 0 > U 0 0 0 0 0 0 0 0 > V 9146 10270 11922 11547 543627 0 0 0 > W 0 0 0 0 0 0 0 0 > X 0 0 0 0 0 0 0 0 > Y 28363 29080 34093 32873 0 0 0 0 > Z 0 0 0 0 0 0 0 0 > [ 0 0 0 0 0 0 0 0 > \\ 0 0 0 0 0 0 0 0 > ] 2104342 2040163 2074921 2094296 0 0 0 0 > > Output from the last line shows how scores decrease from the first four > cycles to the last four. Standard R and Biostrings commands can be used > to ask many interesting questions, such as an overall quality score of > reads (e.g., summing the scores of individual nucleotides) and the > relationship between sequence characteristics (e.g., frequency of 'T') > and read quality. > > > atgc <- alphabetFrequency(sequences(seqq), baseOnly=TRUE) > > qscore <- alphabetFrequency(scores(seqq)) > > dim(qscore) > [1] 2218237 256 > > mb(object.size(qscore)) > [1] 2166.2 MB > > > quality <- colSums(t(qscore) * 1:ncol(qscore)) > > plot(density(quality)) # small secondary peak at low quality > > scores(seqq)[which(quality<2925)] > A BStringSet instance of length 87696 > width seq > [1] 36 PPPPPPPPPPPPPEPPPPPOPPMOOPPOMMMPOOOJ > [2] 36 PPPPPPPPPPPPPPPPPPPPHPPPOPPMOPPPNKMA > [3] 36 PPPPPPPPPPPPPPPPPOPPPPPPEPMPPPMPOFAF > [4] 36 PPPPPPPPPPPPPPPPPPPPOPPPPPPPPPOPOOHK > [5] 36 PPPPPPPPPPPPPPPPPPPPPPPOPPOPOPPMKMLF > [6] 36 PPPPPPPPPPPPPPPPOPPPPPCPPPPHOOPPOOOO > [7] 36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOKO > [8] 36 PPPPPPPPPPPPPPPPPPPOPPPPPPPPJPOPOMAO > [9] 36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOOL > ... ... ... > [87688] 36 PPPPPPPPMPOPCOOCOOJEOMMCCOEEHCCMAAAA > [87689] 36 PTYOR]NTVVVJOHJCCPMEHCMHOJCECCCCAACA > [87690] 36 Y]]]NTVYYT]TPNRPNRYHCTECOHCHHCECHAAC > [87691] 36 YPRVNVYVPYHORCOCOPEPECJCCHCCJECHAAAA > [87692] 36 PPPPPPPPPPPPPHPPOCPOPPOPCHMMPPPEKHOO > [87693] 36 JJHPTTJYV]]]]JJCJCJJTCCOVCMHMCCOAAAA > [87694] 36 PPPPPPPPPPPPPPPPPOPPPJPPPCPOHHCMOCAF > [87695] 36 TYR]R]TN]YOEPTNERPPTVTCTVCCCRHOMIFAC > [87696] 36 YRRYJVVTPHVPPECHPNCMCCJCEECOCCCCAAAA > > > > t <- atgc[,"T"] / rowSums(atgc[,1:4]) > > cor(t, quality) > [1] 0.154 > > All of these operations are quick enough to perform in an interactive > session; the qscore is a large matrix (it can be made smaller by > choosing bounds that reflect allowable scores, e.g., 32:127), and its > transposition is relatively expensive. > > A final point to remember is that R stores a matrix m as a vector of > length nrow(m) * ncol(m). R has an internal limit on the size of a > vector equal to 2^32-1, so the maximum number of reads whose scores can > be represented by alphabetFrequency is less than 2^32 / 256, or about 16 > million reads; this number of reads might be approached in a single > Solexa lane; a simple solution is to divide the DNAStringSet into pieces > that are processed separately. > > > > I hope that the forgoing provides some indication of where we stand at > the moment. Again, it would be great to have feedback, and to hear of > other efforts. And again, the programming credit goes to Herve Pages. > > Martin > > The obligatory sessionInfo, plus some stats on processing this document > (referencing, in the end, three different data sets). > > > sessionInfo() > R version 2.7.0 alpha (2008-03-28 r44972) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets > [7] methods base > > other attached packages: > [1] BiostringsCinterfaceDemo_0.1.2 > [2] BSgenome.Hsapiens.UCSC.hg18_1.3.2 > [3] BSgenome_1.7.4 > [4] Biostrings_2.7.41 > [5] Biobase_1.99.4 > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 9.37e+05 50.1 4.87e+06 260 1.93e+07 1033 > Vcells 6.88e+08 5252.6 1.31e+09 10031 1.31e+09 9998 > > proc.time() > user system elapsed > 356.7 16.6 378.5 > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > > > ********************************************************************** > This email and any files transmitted with it are confi...{{dropped:12}} _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
