Dear Nils,
I counted BY HAND the number of duplicates that have the same tile in the
A.debug.L1.sam file I had already sent you (note that there’s only a single
lane). The number is 12 (which matches my script). However, picard
MarkDuplicates is reporting 25 READ_PAIR_OPTICAL_DUPLICATES, that is 50.
I really don’t want to be a pest, however we find that the optical duplicates
functionality is AWESOME, and we’d be extremely happy for it to work.
Thank you again for your help.
Anna
From: Nils Homer [mailto:nho...@broadinstitute.org]
Sent: Thursday, October 16, 2014 8:41 PM
To: Salzberg, Anna
Cc: samtools-help@lists.sourceforge.net
Subject: Re: [Samtools-help] Reporting Bug - Optical Duplicates of Picard
MarkDuplicates
Thanks Anna for the example set. I have observed a few things regarding this
issue
The first is that we do not extract the lane # from the read name, only tile,
x-coordinate, and y-coordinate. You can see this in the code here if you are
interested:
https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java#L84-L104
Secondly, we also do not retrieve either the barcode information or library
identifier in the read name, since they themselves are not embedded in the read
name. Both barcode and library identifier are also important to condition upon
when searching for optical duplicates, or duplicates in general.
This brings us to where *do* we expect to retrieve this information? We use
the read group header lines to capture lane, barcode, library, flowcell (for
Illumina) and other information for specific sets or groups of reads. If this
information is given, which I recommend that as a best practice it should,
MarkDuplicates will behave as you expect. I believe it is much more robust to
annotate these metadata in the header rather than rely on parsing read names
wholly, since read name structures do change, albeit infrequently.
I would recommend adding read groups to your SAM header within your pipeline.
We use FastqToSam or IlluminaBasecallsToSam to set the read group appropriately
depending on our inputs. In Picard, we also have tools like
AddOrReplaceReadGroups that can help you add read groups prior to marking
duplicates.
Nils
------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help