I have written an optical duplicate remover, but I would like to know your 
exact rules to identify two reads being duplicates. As I briefly skimmed 
OpticalDuplicateFinder.java, it relies on a parameter 
“this.opticalDuplicatePixelDistance”, which is expected. Are you using the same 
threshold?

> we do not extract the lane # from the read name, only tile, x-coordinate, and 
> y-coordinate.

Nils, why not use lane number?

Heng

On Oct 20, 2014, at 13:49, Salzberg, Anna <asalzb...@hmc.psu.edu> wrote:

> Dear Nils,
>  
> I counted BY HAND the number of duplicates that have the same tile in the 
> A.debug.L1.sam file I had already sent you (note that there’s only a single 
> lane).  The number is 12 (which matches my script).  However, picard 
> MarkDuplicates is reporting 25 READ_PAIR_OPTICAL_DUPLICATES, that is 50.
>  
> I really don’t want to be a pest, however we find that the optical duplicates 
> functionality is AWESOME, and we’d be extremely happy for it to work.
>  
> Thank you again for your help.
> Anna
>  
>  
> From: Nils Homer [mailto:nho...@broadinstitute.org] 
> Sent: Thursday, October 16, 2014 8:41 PM
> To: Salzberg, Anna
> Cc: samtools-help@lists.sourceforge.net
> Subject: Re: [Samtools-help] Reporting Bug - Optical Duplicates of Picard 
> MarkDuplicates
>  
> Thanks Anna for the example set.  I have observed a few things regarding this 
> issue
>  
> The first is that we do not extract the lane # from the read name, only tile, 
> x-coordinate, and y-coordinate.  You can see this in the code here if you are 
> interested: 
> https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java#L84-L104
>  
> Secondly, we also do not retrieve either the barcode information or library 
> identifier in the read name, since they themselves are not embedded in the 
> read name.  Both barcode and library identifier are also important to 
> condition upon when searching for optical duplicates, or duplicates in 
> general.  
>  
> This brings us to where *do* we expect to retrieve this information?  We use 
> the read group header lines to capture lane, barcode, library, flowcell (for 
> Illumina) and other information for specific sets or groups of reads.  If 
> this information is given, which I recommend that as a best practice it 
> should, MarkDuplicates will behave as you expect.  I believe it is much more 
> robust to annotate these metadata in the header rather than rely on parsing 
> read names wholly, since read name structures do change, albeit infrequently.
>  
> I would recommend adding read groups to your SAM header within your pipeline. 
>  We use FastqToSam or IlluminaBasecallsToSam to set the read group 
> appropriately depending on our inputs.  In Picard, we also have tools like 
> AddOrReplaceReadGroups that can help you add read groups prior to marking 
> duplicates.
>  
> Nils
> ------------------------------------------------------------------------------
> Comprehensive Server Monitoring with Site24x7.
> Monitor 10 servers for $9/Month.
> Get alerted through email, SMS, voice calls or mobile push notifications.
> Take corrective actions from your mobile device.
> http://p.sf.net/sfu/Zoho_______________________________________________
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help


------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to