I have a question about using picard MarkDuplicates with single-end reads.

I couldn't find information on how Picard defines duplicate reads in the 
manual, but in various other places I have read that duplicate reads (for 
single end reads at least) are any two reads with the same start position and 
cigar string (although for paired, I gathered it is by the positions of the 5' 
ends of each read pair).

However, in my results (single end sequencing), this does not appear to be 
correct: all reads with the same starting position are collapsed regardless of 
mapped end position, read length or cigar string, leaving only one read.  In 
many cases, this seems like a poor choice - collapsing reads that are very 
unlikely to be PCR duplicates: spliced reads and unspliced ones, 100 bp reads 
with 20 bp ones.

I was wondering whether this is expected behaviour for single end reads - to 
identify duplicates based solely on 5' mapping location?  Also, if that is 
expected behaviour, if I want to collapse based on 5' location and CIGAR string 
(or something similar), does anyone know of an existing tool that does this or 
should I just write one myself?

Thanks!

Kat


------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to