Dear Nils,

I counted BY HAND the number of duplicates that have the same tile in the 
A.debug.L1.sam file I had already sent you (note that there’s only a single 
lane).  The number is 12 (which matches my script).  However, picard 
MarkDuplicates is reporting 25 READ_PAIR_OPTICAL_DUPLICATES, that is 50.

I really don’t want to be a pest, however we find that the optical duplicates 
functionality is AWESOME, and we’d be extremely happy for it to work.

Thank you again for your help.
Anna


From: Nils Homer [mailto:nho...@broadinstitute.org]
Sent: Thursday, October 16, 2014 8:41 PM
To: Salzberg, Anna
Cc: samtools-help@lists.sourceforge.net
Subject: Re: [Samtools-help] Reporting Bug - Optical Duplicates of Picard 
MarkDuplicates

Thanks Anna for the example set.  I have observed a few things regarding this 
issue

The first is that we do not extract the lane # from the read name, only tile, 
x-coordinate, and y-coordinate.  You can see this in the code here if you are 
interested: 
https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java#L84-L104

Secondly, we also do not retrieve either the barcode information or library 
identifier in the read name, since they themselves are not embedded in the read 
name.  Both barcode and library identifier are also important to condition upon 
when searching for optical duplicates, or duplicates in general.

This brings us to where *do* we expect to retrieve this information?  We use 
the read group header lines to capture lane, barcode, library, flowcell (for 
Illumina) and other information for specific sets or groups of reads.  If this 
information is given, which I recommend that as a best practice it should, 
MarkDuplicates will behave as you expect.  I believe it is much more robust to 
annotate these metadata in the header rather than rely on parsing read names 
wholly, since read name structures do change, albeit infrequently.

I would recommend adding read groups to your SAM header within your pipeline.  
We use FastqToSam or IlluminaBasecallsToSam to set the read group appropriately 
depending on our inputs.  In Picard, we also have tools like 
AddOrReplaceReadGroups that can help you add read groups prior to marking 
duplicates.

Nils
------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to