I have a question about using picard MarkDuplicates with single-end reads. I couldn't find information on how Picard defines duplicate reads in the manual, but in various other places I have read that duplicate reads (for single end reads at least) are any two reads with the same start position and cigar string (although for paired, I gathered it is by the positions of the 5' ends of each read pair).
However, in my results (single end sequencing), this does not appear to be correct: all reads with the same starting position are collapsed regardless of mapped end position, read length or cigar string, leaving only one read. In many cases, this seems like a poor choice - collapsing reads that are very unlikely to be PCR duplicates: spliced reads and unspliced ones, 100 bp reads with 20 bp ones. I was wondering whether this is expected behaviour for single end reads - to identify duplicates based solely on 5' mapping location? Also, if that is expected behaviour, if I want to collapse based on 5' location and CIGAR string (or something similar), does anyone know of an existing tool that does this or should I just write one myself? Thanks! Kat ------------------------------------------------------------------------------ Infragistics Professional Build stunning WinForms apps today! Reboot your WinForms applications with our WinForms controls. Build a bridge from your legacy apps to the future. http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help