Hey Katie,

it looks like you have at least one small test case with which we could
debug.  Could you provide it in a SAM or BAM format, header included, and
re-specify what is wrong and what you expect to help us debug?

Picard examines reads for duplicates by using the 5' unclipped position of
each read.  For those that have the same 5' unclipped position, they are
treated as duplicates.  For pairs, both ends of the pair need to have the
same 5' unclipped position.  Pairs are also preferred over unpaired reads,
meaning if a single ended read has the same 5' unclipped position as one
end of a pair, then the single ended read is marked as a duplicate.

The 5' position is relative to the original direction of sequencing, not
genomic 5', so if the read maps to the reverse strand the start of the read
from MarkDuplicate's perspective will be on the 3' genomic end.

N

On Tue, Feb 3, 2015 at 4:37 PM, Katie Faust Stryjewski <kati...@bu.edu>
wrote:

> Hello,
>
> We are having some issues with Picard's MarkDuplicates function, and we
> were hoping someone could shed some light on how exactly this algorithm
> works and why it might be giving us these results.
>
> To give a quick overview of the data we're working with: we have samples
> from several closely-related species of finches. Our libraries were
> generated using Illumina's Nextera kit. We removed adapters using CutAdapt,
> merged overlapping reads using PEAR, aligned the sequences to a reference
> genome (Zebra Finch) using bowtie2, converted the resulting .sam files to
> .bam files using Samtools, and finally sorted the reads in coordinate order
> using Picard. Then we ran MarkDuplicates.
>
> Looking at the results, we noticed that a much higher proportion of our
> unpaired reads (~15-60%; these are the reads that were merged using PEAR)
> are being marked as duplicates than our paired reads (1-3%). When we
> examine the reads marked as duplicates, we see two main problems. The first
> that reads that were soft clipped to the same starting site during local
> alignment are being marked as duplicates. Our species are at least several
> million years divergent from the reference genome, and bowtie is clipping
> the ends of fragments where the alignments don't match well. However, we
> were under the impression that Picard was able to take this into account.
>
> The second issue is that we also see reads being marked as duplicates with
> no explanation that we can determine. One example is below. The unpaired
> read marked as a duplicate sits between the first reads of two different
> paired sequences. None of these reads were trimmed during alignment. The
> lines from the .bam file for this duplicate and the two sets of paired
> reads are below.
>
> Out of 560,512 reads marked as duplicates from one sample, only 126,256
> (22.5%) have the same starting site and CIGAR string. Any insight as to why
> this might be would be greatly appreciated.
>
> Thank you very much,
>
> Katie Stryjewski
>
> --
> Katherine Faust Stryjewski
> Ph.D Candidate
> Biology Department
> Boston University
> 5 Cummington St.
> Boston, MA 02215
>
>
> B29860:1:1107:16265:46832 67 chr1 488639 44 151M = 488774 286
> CTCCAAACCTCAGCACGAAACCTCAAACCAAACTTTAAGATGAGAATTGGTTGAGAAAAAAGGATTCTGCCCTCGTGTCTCTGTGGAGAATTGGGTTGATGGAAATGGGCCAGCACCTTGGTACAGGATTAAAAAGCCCCAGGTTTTATTC
> B@CFDADFGHFHBHIIGEHIEGHIIGEHIIIGHIIIIIGIHIIIGEEGFHFGIHGGGDIHHEDEFFFFEECCCDB?BDBCCCDACD>BDACCCDDBDB>@C>9CCDDCBBBDDDCDDDDD?4>CACCDDDDCDCDDDDBBDDD4ACBDDDA
> MD:Z:151 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:302
> YS:i:229 YT:Z:CP
>
> B29860:1:2209:1910:23657 1040 chr1 488670 44 242M * 0 0
> ACTTTAAGATGAGAATTGGTTGATAAAAAAGGATTCTGCCCTCGTGTCTCTGTGGAGAATTGGGTTGATGGAAATGGGCCAGCACCTTGGTACAGGATTAAAAAGCCCCAGGTTTTATTCTTCCTGCAGCTGAGAGAACTCCAAAATCAGGCTGGGAAATCACACAAGATCAGGCTGGGAAAGAAACAAAAATTATATTCTAGCAGAACAAACCTGTTGTTGGCTTTTGACGGGTATTGAAG
> @@@FFFFFBHF<DBEGGGGICH@CFGH<FHGGHGFFDGDDH@FHDG@FGIGF8BFDE@CHEECH
> ??=C>?DFFCCEE>ABBABDB??CCC5IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<I>CAEDEEEEC@CAFFFFFFHFHHDIIIGIIIHHHIIGGHGGFEIHGGGGGIHIIFHGGHGHIIIIIGGHHEACF
> <FIIHHHHHFFFDD@@B MD:Z:23G139T3A14C20A10A6T2A1C0A2T11 PG:Z:MarkDuplicates
> XG:i:0 NM:i:11 XM:i:11 XN:i:0 XO:i:0 AS:i:404 YT:Z:UU
>
> B29860:1:1115:19766:59219 67 chr1 488736 41 148M3S = 488921 335
> GATGGAAATGGGCCAGCACCTTGGTACAGGATTAAAAAGCCCCAGGTTTTATTCCTCCTGCAGCTGAGAGAACTCCAAAATCAGGCTGGGAAATCACACAAGATCAGGCTGGGAAAGAAACAAAAATTATATTCTAGCAGAACAAACCTGT
> C@
> @FFFFFHHHHHGIIIGIIIIIIHEEHIIHIIIIICHEHHIIGIIBHIIEHIIIIIIIIEIIIGIGHHHHFFFFFEEEDEEDDDDDDDBDDDCDCDDDDB??DDDDDCDDDD?CB?CDDDDDBDDDDDEDDDEDDDC@CDDDDDDDBDDC
> MD:Z:54T42T3A14C20A10 PG:Z:MarkDuplicates XG:i:0 NM:i:5 XM:i:5 XN:i:0
> XO:i:0 AS:i:260 YS:i:158 YT:Z:CP
>
> B29860:1:1107:16265:46832 131 chr1 488774 44 151M = 488639 -286
> GCCCCAGGTTTTATTCTTCCTGCAGCTGAGAGAACTCCAAAATCAGGCTGGGAAATCACACAAGATCAGGCTGGGAAAGAAACAAAAATTATATTCTAGCAGAACAAACCTGTTGTTGGCTTTTGACGGGTATTGAAGGGCTGGAGGAAGG
> <DDBACBDCDCDDDDDDBBCCDDDDCCDDDDDDDDC@DDDCCCDCDBCACDDDDCABDDCCCECCCBDDCEECECDD
> @EC@2HHIIIHHIIHGIGIIIIGHIIHIIIHIIIIIIIIIIIGHCGEGEIIHEGHH@HIIGDFBHHFFFFF@?@
> MD:Z:59T3A14C20A10A6T2A1C0A2T24 PG:Z:MarkDuplicates XG:i:0 NM:i:10 XM:i:10
> XN:i:0 XO:i:0 AS:i:229 YS:i:302 YT:Z:CP
>
> B29860:1:1115:19766:59219 131 chr1 488921 41 5M1I145M = 488736 -335
> AAGGACCTCAAAACAAAAAATAAACACCAAAATATTTCATAGAATCATAGATTCATTGAGGTTGGAAAAGCCCTCCCAGCCCAGGGAGTCCCAGCTGGGCCCCATGCCCCCCTTGTCCCTAGCCCAGAGCACTCAGTGCCACCTCCAGGGG
> ADCCCC<B?DCCB>BDEECCACBB?<3DDDEECCA@CEEEDEEEDC@CA:C@DCCDCCCDDDEDCCCDDDB
> >BBBDDCDDBC?8CDDDBDDDDDDDDDBBBBDC@
> <8A?)BHCIGEC:39GHHDIIIIHDCF<GHHDDAFCHFFFFFFC@C
> MD:Z:19C2C6G1C12A0G6A7C8A0G0T1T1A2A2G0T12C0T11A9C31 PG:Z:MarkDuplicates
> XG:i:1 NM:i:21 XM:i:20 XN:i:0 XO:i:1 AS:i:158 XS:i:161 YS:i:260 YT:Z:CP
>
> ᐧ
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to