Hi This all soungs very promising Im very impressed by the speed of these operations. Can you give a little more detail about the underlying structure of PDict and how it 'matches' sequences?
________________________________ From: [EMAIL PROTECTED] on behalf of Martin Morgan Sent: Wed 02/04/2008 04:20 To: [email protected] Subject: [Bioc-sig-seq] Bioc short read directions Short-readers! I want to take this opportunity to provide an update on the directions we've initiated for short reads. It would be great to get your feedback, and to hear of your own efforts. WARNING: This software is in development, and is far from final. In particular, functions in the BiostringsCinterfaceDemo package are NOT meant to be final or for use in other packages; they're here to demonstrate 'proof-of-concept' and to illustrate how users can access the Biostrings package from C code. I'll indicate which functions are from BiostringsCinterfaceDemo below. Expect a short-read 'base' package to materialize in the devel branch of Bioconductor in the not too distant future. WARNING: The data used to illustrate functionality are not meant to be indiciative of the quality of data produced by Solexa; they are generally 'first runs' that present a number of interesting challenges for interpretation. HARDWARE AND SOFTWARE: The following include timing and object size measurements. The machine being used is fast, but we're not doing anything fancy to, e.g., exploit multiple processors. The machine has a very large amount of memory; we used about 10 GB below, looking at three different data sets. The following uses the R-2-7-branch (this is different from R-devel). The Biostrings and BiostringsCinterfaceDemo packages are updated very regularly, so be prepared for broken or outdated functions. Herve Pages is responsible for the clever code; I am just a scribe. Ok, first a convenience function to print out 'size' in megabytes, 'cause objects are large! > mb <- function(sz) noquote(sprintf("%.1f MB", sz / (1024^2))) * Starters... We load the BiostringsCinterfaceDemo, which requires Biostrings. Both of these need to be from the 'development' branch of Bioconductor. Both are changing rapidly, and should be obtained from svn and updated regularly (http://wiki.fhcrc.org/bioc/DeveloperPage, http://wiki.fhcrc.org/bioc/SvnHowTo). > library(BiostringsCinterfaceDemo) * I/O, DNAStringSet, and alphabetFrequency We next read in a fasta file derived from a lane of solexa reads. Here's what the data looks like: > readLines(fastaFile, 10) [1] ">5_1_102_368" [2] "TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT" [3] ">5_1_120_254" [4] "TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT" [5] ">5_1_110_385" [6] "GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC" [7] ">5_1_118_88" [8] "GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC" [9] ">5_1_113_327" [10] "GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC" This is a single lane from a Solexa training run; the data are not filtered, and the run is not meant to be representative in terms of quality or other characteristics. The DNA used for the reads is from phage phiX-174. Here's how we read it in (countLines and readSolexaFastA are in BiostringsCinterfaceDemo) > countLines(fastaFile) s_5.fasta 18955056 > system.time({ + seqa <- readSolexaFastA(fastaFile) + }, gcFirst=TRUE) user system elapsed 67.48 2.08 69.68 > mb(object.size(seqa)) [1] 397.7 MB > seqa A DNAStringSet instance of length 9477528 width seq [1] 36 TAAGAGGTTTAAATTTTCTTCAGGTCAGTATTCTTT [2] 36 TTAATTCGTAAACAAGCAGTAGTAATTCCTGCTTTT [3] 36 GCTAATTTGCCTACTAACCAAGAACTTGATTTCTTC [4] 36 GTTTGGAGTGATACTGACCGCTCTCGTGGTCGTCGC [5] 36 GCTTGCGTTTATGGTACGCTGGACTTTGTAGGATAC [6] 36 TGACCCTCAGCAATCTTAAACTTCTTAGACGAATCA [7] 36 GCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGC [8] 36 TTTAGGTGTCTGTAAAACAGGTGCCGAAGAAGCTGG [9] 36 GGTCTGTTGAACACGACCAGAAAACTGGCCTAACGA ... ... ... [9477520] 36 TACGCAGTTTTGCCGTATACTCGTTGTTCTGACTCT [9477521] 36 TATACCCCCCCTCCTACTTGTGCTGTTTCTCATGTT [9477522] 36 CAGGTTGTTTCTGTTGGTGCTGATATTTCTTTTTTT [9477523] 36 GTCTTCCTTGCTTGTCAGATTGGTCGTCTTATTACC [9477524] 36 ATACGAAAGACCAGGTATATGCACAAAATGAGTTGC [9477525] 36 ACCACAAACGCGCTCGTTTATGCTTGCCTCTATTAC [9477526] 36 ------------------------------------ [9477527] 36 CCAGCAAGGAAGCCAAGATGGGAAAGGTCATGCGGC [9477528] 36 CATTGTTGACCACCTACATACCACAGACGAGCACCT It takes just over a minute to read in the nearly 9.5 million reads. The reads are stored efficiently, without the overhead of R character strings. The data structure (a DNAStringSet, from Biostrings and therefore stable) will not copy the large data, but instead contains 'views' into it. A basic question is about the nucleotides present in the reads. alphabetFrequency (from Biostrings) scans all the sequences and tallies nucleotides > system.time({ + alf1 <- alphabetFrequency(seqa, collapse=TRUE, freq=TRUE) + }, gcFirst=TRUE) user system elapsed 0.612 0.000 0.612 > alf1 A C G T M R W S Y K 0.2449 0.2119 0.2201 0.3030 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 V H D B N - 0.0000 0.0000 0.0000 0.0000 0.0000 0.0201 (bases are recored with IUPAC symbols, see ?IUPAC_CODE_MAP). This executes very efficiently. A variant produces a matrix with rows corresponding to reads and columns to bases. > alf <- alphabetFrequency(seqa, baseOnly=TRUE) > dim(alf) [1] 9477528 5 > head(alf) A C G T other [1,] 9 4 6 17 0 [2,] 11 6 5 14 0 [3,] 10 9 4 13 0 [4,] 4 9 12 11 0 [5,] 6 6 11 13 0 [6,] 12 10 4 10 0 This can be remarkably useful. For instance, to select just the 'clean' sequences (those without ambiguous base calls), one can > cleanSeqs <- seqa[alf[,"other"]==0] > length(seqa) [1] 9477528 > length(cleanSeqs) [1] 9207292 This creates a new DNAStringSet with just the clean sequences. It executes very quickly, because the DNAStringSet is a view into the original. The memory associated with the reads themselves is not copied. here is the alphabetFrequency of the 'clean' reads. > cleanAlf <- alphabetFrequency(cleanSeqs, baseOnly=TRUE) Again this is very useful, for instance the distribution of GC content among clean reads is > plot(density(rowSums(cleanAlf[,c("G", "C")]) / rowSums(cleanAlf))) * PDict, countPDict, matchPDict A 'PDict' (defined in Biostrings) is a dictionary-like structure that can be used for very efficient exact- and partially-exact matching algorithms. To illustrate, we'll use data from about a million reads of the Solexa BAC cloning vector. These reads again come from an early run on the Solexa instrumentation here, and results should not be taken to be representative of performance. We read and clean the sequences as above, resulting in > length(cleanSeqs) [1] 923680 We then create a PDict from our DNAStringSet with > system.time({ + pDict <- PDict(cleanSeqs) + }, gcFirst=TRUE) user system elapsed 1.09 0.00 1.10 > pDict 923680-pattern constant width PDict object of width 25 (patterns have no names) > mb(object.size(pDict)) [1] 160.4 MB This is created quickly. It is a larger object, but the size allows fast searches. Here we'll use Biostrings readFASTA to read in the sequence to which the data are to be aligned. > bac <- read.DNAStringSet(bacFile, "fasta")[[1]] Read 2479 items > length(bac) [1] 173427 This is a BAC clone. We'll match our pDict to the BAC subject, finding all EXACT matches; > system.time({ + counts <- countPDict(pDict, bac) + }, gcFirst=FALSE) user system elapsed 0.200 0.048 0.268 > length(counts) [1] 923680 > table(counts)[1:5]/sum(table(counts)) counts 0 1 2 3 4 0.53954 0.42738 0.01136 0.00528 0.00334 This is very fast, partly because the subject against which the PDict is being matched is short. A more realistic use case is to match against a genome. The integration with BSgenome packages is very smooth. To match our pDict against human chromosome 6 (where the BAC cloning vector comes from), we can load the appropriate BSgenome package and chromosome > library(BSgenome.Hsapiens.UCSC.hg18) > Hsapiens[["chr6"]] 170899992-letter "DNAString" instance seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN (The N's represent the chromsome telomeres, whose seuqences is not available). We look for exact matches > system.time({ + hcount <- countPDict(pDict, Hsapiens[["chr6"]]) + }, gcFirst=TRUE) user system elapsed 24.2 0.0 24.2 > table(hcount)[1:5] / sum(table(hcount)) hcount 0 1 2 3 4 0.50466 0.32996 0.02043 0.00959 0.00761 About 1/2 the sequences exactly match one or more locations on chromosome 6. Some sequences match many times, though the reason (in this case, anyway) is not too surprising: > max(hcount) [1] 8286 > maxIdx = which(hcount==max(hcount)) > cleanSeqs[maxIdx] A DNAStringSet instance of length 70 width seq [1] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [2] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [3] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [4] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [5] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [6] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [7] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [8] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [9] 25 TTTTTTTTTTTTTTTTTTTTTTTTT ... ... ... [62] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [63] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [64] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [65] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [66] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [67] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [68] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [69] 25 TTTTTTTTTTTTTTTTTTTTTTTTT [70] 25 TTTTTTTTTTTTTTTTTTTTTTTTT (that these are identical can be checked with unique(as.character(cleanSeqs[maxIdx]))) The current implementation of PDict allows for a 'trusted band' of nucleotides that need to match exactly, allowing for approximate matches in the remaining nucleotides. Here we trust the first 12 bases, and allow up to 3 mismatches. Also, we use the more informative matchPDict, which allows us to find the positions of all matches > trusted <- PDict(cleanSeqs, tb.end=12) > system.time({ + mmMatch <- matchPDict(trusted, Hsapiens[[6]], max.mismatch=3) + }, gcFirst=TRUE) user system elapsed 195.58 3.85 199.44 > table(cut(countIndex(mmMatch), c(0, 10^(0:5)), right=FALSE)) [0,1) [1,10) [10,100) [100,1e+03) [1e+03,1e+04) 377233 337806 72983 66322 60818 [1e+04,1e+05) 8518 Execution time increases as the stringency of the match decreases. PDict facilities do not yet incorporate quality scores, but filtering results based on quality (qualities are discussed further below) represents a natural direction for development. * alphabetByCycle alphabetByCycle uses a small C function in BiostringsCinterfaceDemo, alphabet_by_cycle. It and the read* functions in this package illustrate how to access DNAStringSet objects at the C level. alphabetByCycle is a matrix that tallies nucleotide use per cycle > system.time({ + abc <- alphabetByCycle(seqa) + }) user system elapsed 2.68 0.00 2.68 Again this can be quite useful. For instance, we can find out the number of bases that were not called, as a function of cycle > abc["-",] [1] 22181 108829 123173 180382 225091 225055 216787 208538 208881 [10] 213104 148936 142966 141030 148163 178747 204304 211538 211303 [19] 213722 211167 208021 208715 165441 158359 147110 151462 204781 [28] 221008 223922 221171 227622 226936 232232 236606 242387 241718 and the number of 'T' nucleotides as a function of cycle > abc["T",] / colSums(abc[1:4,]) [1] 0.286 0.292 0.292 0.284 0.292 0.280 0.293 0.297 0.299 0.287 0.290 [12] 0.294 0.294 0.299 0.300 0.301 0.304 0.305 0.306 0.309 0.310 0.311 [23] 0.312 0.316 0.315 0.319 0.320 0.322 0.326 0.328 0.331 0.335 0.339 [34] 0.342 0.346 0.358 That's quite a striking increase after cycle 25! * Qualities Solexa reads have quality scores associated with each base call. These are summarized in files formatted like: > readLines(fastqFile, n=8) [1] "@HWI-EAS88_1_1_1_1001_499" [2] "GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT" [3] "+HWI-EAS88_1_1_1_1001_499" [4] "]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS" [5] "@HWI-EAS88_1_1_1_898_392" [6] "GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC" [7] "+HWI-EAS88_1_1_1_898_392" [8] "]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK" A record consists of four lines. The first and third lines are identifiers (repeated), the second line is the sequence, the fourth line an ASCII character representing the score (']' is good, Z is better than A). Here we read a quality file into a data structure defined in BiostringsCinterfaceDemo designed to coordinate the sequence, name, and quality information. > system.time({ + seqq <- readSolexaFastQ(fastqFile) + }, gcFirst=TRUE) user system elapsed 17.533 0.417 17.969 > mb(object.size(seqq)) [1] 254.5 MB > seqq class: SolexaSequenceQ length: 2218237 > sequences(seqq)[1:5] A DNAStringSet instance of length 5 width seq [1] 36 GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT [2] 36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC [3] 36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT [4] 36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA [5] 36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT > names(seqq)[1:5] A BStringSet instance of length 5 width seq [1] 24 HWI-EAS88_1_1_1_1001_499 [2] 23 HWI-EAS88_1_1_1_898_392 [3] 23 HWI-EAS88_1_1_1_922_465 [4] 23 HWI-EAS88_1_1_1_895_493 [5] 23 HWI-EAS88_1_1_1_953_493 > scores(seqq)[1:5] A BStringSet instance of length 5 width seq [1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS [2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK [3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA [4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA [5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS The object returned by readSolexaFastQ is an S4 object with three slots. Each slot contains a XStringSet, where 'X' is DNA for sequences, and 'BString' for names and scores. This represents one way of structuring quality data; the S4 class coordinates subsetting, and provides (read-only) accessors to the underlying objects. A likely addition to this class as it matures is the inclusion of lane-specific phenotype (sample) information, much as an ExpressionSet coordinates sample and expression values. We can gain some basic insight into the sequences as before, e.g., > abc <- alphabetByCycle(sequences(seqq)) > abc["N",] [1] 0 0 0 0 0 0 0 0 0 0 0 [12] 0 1213 1631 1155 1240 721 418 8503 526 6493 703 [23] 14999 718 1623 737 243 40704 811 590 1964 961 809 [34] 910 477 208 Solexa provides files with quality information after filtering reads based on 'purity', a measure that precludes uncertain bases (IUPAC code 'N') from a user-specified region (the first 12 cycles, by default). > abc["T",] / colSums(abc[1:4,]) [1] 0.298 0.290 0.292 0.290 0.295 0.279 0.292 0.298 0.301 0.293 0.295 [12] 0.299 0.296 0.293 0.294 0.294 0.293 0.295 0.294 0.297 0.297 0.299 [23] 0.298 0.303 0.302 0.307 0.312 0.312 0.321 0.331 0.334 0.351 0.361 [34] 0.374 0.380 0.423 We can also summarize quality information by cycle, using an alphabet that reflects the encoded scores: > alphabet <- sapply(as.raw(32:93), rawToChar) > abcScore <- alphabetByCycle(scores(seqq), alphabet=alphabet) > > rowSums(abcScore) ! " # $ % & ' 0 0 0 0 0 0 0 0 ( ) * + , - . / 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 8 9 : ; < = > ? 0 0 0 0 0 0 0 0 @ A B C D E F G 0 1367337 0 2035064 0 1006372 819775 0 H I J K L M N O 1926112 116395 1614109 356496 517731 1602298 578172 2016009 P Q R S T U V W 3043393 401196 2152160 1553511 2149231 254108 3896558 232409 X Y Z [ \\ ] 706452 4353706 1292309 732210 0 45133419 > abcScore[34:62,c(1:4, 33:36)] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] A 1 1788 1798 8 260944 299631 332500 354717 B 0 0 0 0 0 0 0 0 C 59 36426 12705 227 39 131268 140461 154643 D 0 0 0 0 0 0 0 0 E 133 13391 4033 180 120135 0 0 0 F 0 0 0 0 0 259549 272238 287988 G 0 0 0 0 0 0 0 0 H 272 8815 2933 357 242741 234088 239917 239157 I 0 0 0 0 116395 0 0 0 J 472 4678 1656 586 0 195991 198114 187535 K 5 59 25 11 110052 84345 83744 77353 L 0 0 0 0 102438 144459 142049 128785 M 75 171 121 92 93945 119209 116492 103765 N 653 2992 1232 769 85940 91924 89084 79246 O 1227 2719 1754 1576 175403 187339 176997 160100 P 65834 57830 60855 65863 61910 0 0 0 Q 0 0 0 0 236290 0 0 0 R 3211 4482 4321 4215 0 0 0 0 S 0 0 0 0 68378 470434 426641 444948 T 4444 5373 5868 5637 0 0 0 0 U 0 0 0 0 0 0 0 0 V 9146 10270 11922 11547 543627 0 0 0 W 0 0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 Y 28363 29080 34093 32873 0 0 0 0 Z 0 0 0 0 0 0 0 0 [ 0 0 0 0 0 0 0 0 \\ 0 0 0 0 0 0 0 0 ] 2104342 2040163 2074921 2094296 0 0 0 0 Output from the last line shows how scores decrease from the first four cycles to the last four. Standard R and Biostrings commands can be used to ask many interesting questions, such as an overall quality score of reads (e.g., summing the scores of individual nucleotides) and the relationship between sequence characteristics (e.g., frequency of 'T') and read quality. > atgc <- alphabetFrequency(sequences(seqq), baseOnly=TRUE) > qscore <- alphabetFrequency(scores(seqq)) > dim(qscore) [1] 2218237 256 > mb(object.size(qscore)) [1] 2166.2 MB > quality <- colSums(t(qscore) * 1:ncol(qscore)) > plot(density(quality)) # small secondary peak at low quality > scores(seqq)[which(quality<2925)] A BStringSet instance of length 87696 width seq [1] 36 PPPPPPPPPPPPPEPPPPPOPPMOOPPOMMMPOOOJ [2] 36 PPPPPPPPPPPPPPPPPPPPHPPPOPPMOPPPNKMA [3] 36 PPPPPPPPPPPPPPPPPOPPPPPPEPMPPPMPOFAF [4] 36 PPPPPPPPPPPPPPPPPPPPOPPPPPPPPPOPOOHK [5] 36 PPPPPPPPPPPPPPPPPPPPPPPOPPOPOPPMKMLF [6] 36 PPPPPPPPPPPPPPPPOPPPPPCPPPPHOOPPOOOO [7] 36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOKO [8] 36 PPPPPPPPPPPPPPPPPPPOPPPPPPPPJPOPOMAO [9] 36 PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPOOOL ... ... ... [87688] 36 PPPPPPPPMPOPCOOCOOJEOMMCCOEEHCCMAAAA [87689] 36 PTYOR]NTVVVJOHJCCPMEHCMHOJCECCCCAACA [87690] 36 Y]]]NTVYYT]TPNRPNRYHCTECOHCHHCECHAAC [87691] 36 YPRVNVYVPYHORCOCOPEPECJCCHCCJECHAAAA [87692] 36 PPPPPPPPPPPPPHPPOCPOPPOPCHMMPPPEKHOO [87693] 36 JJHPTTJYV]]]]JJCJCJJTCCOVCMHMCCOAAAA [87694] 36 PPPPPPPPPPPPPPPPPOPPPJPPPCPOHHCMOCAF [87695] 36 TYR]R]TN]YOEPTNERPPTVTCTVCCCRHOMIFAC [87696] 36 YRRYJVVTPHVPPECHPNCMCCJCEECOCCCCAAAA > > t <- atgc[,"T"] / rowSums(atgc[,1:4]) > cor(t, quality) [1] 0.154 All of these operations are quick enough to perform in an interactive session; the qscore is a large matrix (it can be made smaller by choosing bounds that reflect allowable scores, e.g., 32:127), and its transposition is relatively expensive. A final point to remember is that R stores a matrix m as a vector of length nrow(m) * ncol(m). R has an internal limit on the size of a vector equal to 2^32-1, so the maximum number of reads whose scores can be represented by alphabetFrequency is less than 2^32 / 256, or about 16 million reads; this number of reads might be approached in a single Solexa lane; a simple solution is to divide the DNAStringSet into pieces that are processed separately. I hope that the forgoing provides some indication of where we stand at the moment. Again, it would be great to have feedback, and to hear of other efforts. And again, the programming credit goes to Herve Pages. Martin The obligatory sessionInfo, plus some stats on processing this document (referencing, in the end, three different data sets). > sessionInfo() R version 2.7.0 alpha (2008-03-28 r44972) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] tools stats graphics grDevices utils datasets [7] methods base other attached packages: [1] BiostringsCinterfaceDemo_0.1.2 [2] BSgenome.Hsapiens.UCSC.hg18_1.3.2 [3] BSgenome_1.7.4 [4] Biostrings_2.7.41 [5] Biobase_1.99.4 > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 9.37e+05 50.1 4.87e+06 260 1.93e+07 1033 Vcells 6.88e+08 5252.6 1.31e+09 10031 1.31e+09 9998 > proc.time() user system elapsed 356.7 16.6 378.5 _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing ********************************************************************** This email and any files transmitted with it are confide...{{dropped:11}} _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
