Hi Loyal, Loyal Goff wrote: > This is a great start...thanks to both Martin and Herve. The speed is > indeed impressive! I do have one question. Would it be advantageous > to reduce the data to a unique list of read sequences, and in doing so > both retain counts in a separate slot and reduce the matrix size? It > seems to me this would speed everything along as well. (ie. only > attempt to align a unique sequence once).
PDict()/matchPDict() do this already. A PDict object has a @dups slot for storing the duplicate information. When the reads are preprocessed with PDict(), only unique reads are stored in the Aho-Corasick tree (@actree slot), and, for each duplicated read, a pointer to the first read that it duplicates is stored in the @dups slot. Then, when the PDict object is passed to matchPDict() (or countPDict()), the matches are searched only for the unique reads first, and then the @dups slot is used to also report the matches (or match count) for the duplicated reads. All this is transparent to the user. Cheers, H. > Does anyone have a need to > retain independent reads after a quality score cutoff? > > Loyal > > Loyal A. Goff > _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
