On 01/27/2010 05:34 PM, joseph wrote: > Thanks for the clarification. > How do I filter the reads that occur exactly once?
Hi Joseph -- One way is to rnk = srrank(sread(rfq)) identical reads are assigned the same srrank, so rnk %in% which(tabulate(rnk)==1) returns those reads whose rank occurs exactly once and rfq1 = rfq[rnk %in% which(tabulate(rnk)==1)] subsets the rfq object. We should check this with any(srduplicated(sread(rfq1))) which should return FALSE. Maybe worth pointing out that the reads might have different quality base scores, and then it's not clear that they are the 'same'. Also in aligned reads that they might not have been aligned to the same chr/strand/position even when the sequence (and base quality!, yes, some aligners chose amongst equally likely alignment positions by tossing a coin) is the same. Martin > > > > ________________________________ > From: Martin Morgan <[email protected]> > To: joseph <[email protected]> > Cc: [email protected] > Sent: Wed, January 27, 2010 5:29:00 PM > Subject: Re: [Bioc-sig-seq] unique reads count > > On 01/27/2010 02:57 PM, joseph wrote: >> If I understand this correctly, !duplicated does not count unique reads: >> !duplicated(c("A", "B", "B", "C")) >> [1] TRUE TRUE FALSE TRUE >> >> in your example, there are only two uniques (not 3 as counted by >> !duplicated). > > Well, it depends on what your definition of 'duplicate' is, but yes, I > think we agree. > >> Is it correct to say that the number of unique reads is given by the nReads >> when nOccurrences=1 using the tables function? > > again unique needs definition (see unique(c("A", "B", "B", "C")), for > instance!) but yes, nOccurrences is the number of reads that occur > exactly once. > > Martin > >> >> >> ________________________________ >> From: Martin Morgan <[email protected]> >> To: joseph <[email protected]> >> Cc: [email protected] >> Sent: Wed, January 27, 2010 1:06:55 PM >> Subject: Re: [Bioc-sig-seq] unique reads count >> >> Hi Joseph -- >> >> On 01/27/2010 12:33 PM, joseph wrote: >>> Hello >>> I have a ShortReadQ object: >>>> rfq >>> class: ShortReadQ >>> length: 16115723 reads; width: 34 cycles >>> >>> I used the negation of the result from srduplicated to count the unique >>> reads: >>>> sum(!srduplicated(sread(rfq))) >>> [1] 4545719 >>> >>> But also I looked at the frequency with which each read occurs using the >>> tables function: >>>> head(tables(rfq_s_3_mel)$distribution) >>> nOccurrences nReads >>> 1 1 4022038 >>> 2 2 255649 >> >> srduplicated is behaving like 'duplicated', which is to return TRUE when >> an element has already been seen >> >>> duplicated(c("A", "B", "B", "C")) >> [1] FALSE FALSE TRUE FALSE >> >> There's one duplicate, the second 'B'. >> >> After example(srduplicated) I have >> >>> tables(sread(rfq))$distribution >> nOccurrences nReads >> 1 1 239 >> 2 2 7 >> 3 3 1 >>> sum(srduplicated(sread(rfq))) >> [1] 9 >> >> there are 7 reads that are the second of two reads, and 2 reads that are >> the second and third of three reads. >> >> Martin >> >>> >>> I expected that for nOccurrences=1, the nReads should be the same as what I >>> got with !srduplicated. >>> >>> Can anybody explain why I got different counts? >>> Thank you >>> Joseph Dhahbi >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-sig-sequencing mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
