Hello,

We are having some issues with Picard's MarkDuplicates function, and we
were hoping someone could shed some light on how exactly this algorithm
works and why it might be giving us these results.

To give a quick overview of the data we're working with: we have samples
from several closely-related species of finches. Our libraries were
generated using Illumina's Nextera kit. We removed adapters using CutAdapt,
merged overlapping reads using PEAR, aligned the sequences to a reference
genome (Zebra Finch) using bowtie2, converted the resulting .sam files to
.bam files using Samtools, and finally sorted the reads in coordinate order
using Picard. Then we ran MarkDuplicates.

Looking at the results, we noticed that a much higher proportion of our
unpaired reads (~15-60%; these are the reads that were merged using PEAR)
are being marked as duplicates than our paired reads (1-3%). When we
examine the reads marked as duplicates, we see two main problems. The first
that reads that were soft clipped to the same starting site during local
alignment are being marked as duplicates. Our species are at least several
million years divergent from the reference genome, and bowtie is clipping
the ends of fragments where the alignments don't match well. However, we
were under the impression that Picard was able to take this into account.

The second issue is that we also see reads being marked as duplicates with
no explanation that we can determine. One example is below. The unpaired
read marked as a duplicate sits between the first reads of two different
paired sequences. None of these reads were trimmed during alignment. The
lines from the .bam file for this duplicate and the two sets of paired
reads are below.

Out of 560,512 reads marked as duplicates from one sample, only 126,256
(22.5%) have the same starting site and CIGAR string. Any insight as to why
this might be would be greatly appreciated.

Thank you very much,

Katie Stryjewski

-- 
Katherine Faust Stryjewski
Ph.D Candidate
Biology Department
Boston University
5 Cummington St.
Boston, MA 02215


B29860:1:1107:16265:46832 67 chr1 488639 44 151M = 488774 286
CTCCAAACCTCAGCACGAAACCTCAAACCAAACTTTAAGATGAGAATTGGTTGAGAAAAAAGGATTCTGCCCTCGTGTCTCTGTGGAGAATTGGGTTGATGGAAATGGGCCAGCACCTTGGTACAGGATTAAAAAGCCCCAGGTTTTATTC
B@CFDADFGHFHBHIIGEHIEGHIIGEHIIIGHIIIIIGIHIIIGEEGFHFGIHGGGDIHHEDEFFFFEECCCDB?BDBCCCDACD>BDACCCDDBDB>@C>9CCDDCBBBDDDCDDDDD?4>CACCDDDDCDCDDDDBBDDD4ACBDDDA
MD:Z:151 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:302
YS:i:229 YT:Z:CP

B29860:1:2209:1910:23657 1040 chr1 488670 44 242M * 0 0
ACTTTAAGATGAGAATTGGTTGATAAAAAAGGATTCTGCCCTCGTGTCTCTGTGGAGAATTGGGTTGATGGAAATGGGCCAGCACCTTGGTACAGGATTAAAAAGCCCCAGGTTTTATTCTTCCTGCAGCTGAGAGAACTCCAAAATCAGGCTGGGAAATCACACAAGATCAGGCTGGGAAAGAAACAAAAATTATATTCTAGCAGAACAAACCTGTTGTTGGCTTTTGACGGGTATTGAAG
@@@FFFFFBHF<DBEGGGGICH@CFGH<FHGGHGFFDGDDH@FHDG@FGIGF8BFDE@CHEECH
??=C>?DFFCCEE>ABBABDB??CCC5IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<I>CAEDEEEEC@CAFFFFFFHFHHDIIIGIIIHHHIIGGHGGFEIHGGGGGIHIIFHGGHGHIIIIIGGHHEACF
<FIIHHHHHFFFDD@@B MD:Z:23G139T3A14C20A10A6T2A1C0A2T11 PG:Z:MarkDuplicates
XG:i:0 NM:i:11 XM:i:11 XN:i:0 XO:i:0 AS:i:404 YT:Z:UU

B29860:1:1115:19766:59219 67 chr1 488736 41 148M3S = 488921 335
GATGGAAATGGGCCAGCACCTTGGTACAGGATTAAAAAGCCCCAGGTTTTATTCCTCCTGCAGCTGAGAGAACTCCAAAATCAGGCTGGGAAATCACACAAGATCAGGCTGGGAAAGAAACAAAAATTATATTCTAGCAGAACAAACCTGT
C@
@FFFFFHHHHHGIIIGIIIIIIHEEHIIHIIIIICHEHHIIGIIBHIIEHIIIIIIIIEIIIGIGHHHHFFFFFEEEDEEDDDDDDDBDDDCDCDDDDB??DDDDDCDDDD?CB?CDDDDDBDDDDDEDDDEDDDC@CDDDDDDDBDDC
MD:Z:54T42T3A14C20A10 PG:Z:MarkDuplicates XG:i:0 NM:i:5 XM:i:5 XN:i:0
XO:i:0 AS:i:260 YS:i:158 YT:Z:CP

B29860:1:1107:16265:46832 131 chr1 488774 44 151M = 488639 -286
GCCCCAGGTTTTATTCTTCCTGCAGCTGAGAGAACTCCAAAATCAGGCTGGGAAATCACACAAGATCAGGCTGGGAAAGAAACAAAAATTATATTCTAGCAGAACAAACCTGTTGTTGGCTTTTGACGGGTATTGAAGGGCTGGAGGAAGG
<DDBACBDCDCDDDDDDBBCCDDDDCCDDDDDDDDC@DDDCCCDCDBCACDDDDCABDDCCCECCCBDDCEECECDD
@EC@2HHIIIHHIIHGIGIIIIGHIIHIIIHIIIIIIIIIIIGHCGEGEIIHEGHH@HIIGDFBHHFFFFF@?@
MD:Z:59T3A14C20A10A6T2A1C0A2T24 PG:Z:MarkDuplicates XG:i:0 NM:i:10 XM:i:10
XN:i:0 XO:i:0 AS:i:229 YS:i:302 YT:Z:CP

B29860:1:1115:19766:59219 131 chr1 488921 41 5M1I145M = 488736 -335
AAGGACCTCAAAACAAAAAATAAACACCAAAATATTTCATAGAATCATAGATTCATTGAGGTTGGAAAAGCCCTCCCAGCCCAGGGAGTCCCAGCTGGGCCCCATGCCCCCCTTGTCCCTAGCCCAGAGCACTCAGTGCCACCTCCAGGGG
ADCCCC<B?DCCB>BDEECCACBB?<3DDDEECCA@CEEEDEEEDC@CA:C@DCCDCCCDDDEDCCCDDDB
>BBBDDCDDBC?8CDDDBDDDDDDDDDBBBBDC@
<8A?)BHCIGEC:39GHHDIIIIHDCF<GHHDDAFCHFFFFFFC@C
MD:Z:19C2C6G1C12A0G6A7C8A0G0T1T1A2A2G0T12C0T11A9C31 PG:Z:MarkDuplicates
XG:i:1 NM:i:21 XM:i:20 XN:i:0 XO:i:1 AS:i:158 XS:i:161 YS:i:260 YT:Z:CP

ᐧ
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to