I have written an optical duplicate remover, but I would like to know your exact rules to identify two reads being duplicates. As I briefly skimmed OpticalDuplicateFinder.java, it relies on a parameter “this.opticalDuplicatePixelDistance”, which is expected. Are you using the same threshold?
> we do not extract the lane # from the read name, only tile, x-coordinate, and > y-coordinate. Nils, why not use lane number? Heng On Oct 20, 2014, at 13:49, Salzberg, Anna <asalzb...@hmc.psu.edu> wrote: > Dear Nils, > > I counted BY HAND the number of duplicates that have the same tile in the > A.debug.L1.sam file I had already sent you (note that there’s only a single > lane). The number is 12 (which matches my script). However, picard > MarkDuplicates is reporting 25 READ_PAIR_OPTICAL_DUPLICATES, that is 50. > > I really don’t want to be a pest, however we find that the optical duplicates > functionality is AWESOME, and we’d be extremely happy for it to work. > > Thank you again for your help. > Anna > > > From: Nils Homer [mailto:nho...@broadinstitute.org] > Sent: Thursday, October 16, 2014 8:41 PM > To: Salzberg, Anna > Cc: samtools-help@lists.sourceforge.net > Subject: Re: [Samtools-help] Reporting Bug - Optical Duplicates of Picard > MarkDuplicates > > Thanks Anna for the example set. I have observed a few things regarding this > issue > > The first is that we do not extract the lane # from the read name, only tile, > x-coordinate, and y-coordinate. You can see this in the code here if you are > interested: > https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java#L84-L104 > > Secondly, we also do not retrieve either the barcode information or library > identifier in the read name, since they themselves are not embedded in the > read name. Both barcode and library identifier are also important to > condition upon when searching for optical duplicates, or duplicates in > general. > > This brings us to where *do* we expect to retrieve this information? We use > the read group header lines to capture lane, barcode, library, flowcell (for > Illumina) and other information for specific sets or groups of reads. If > this information is given, which I recommend that as a best practice it > should, MarkDuplicates will behave as you expect. I believe it is much more > robust to annotate these metadata in the header rather than rely on parsing > read names wholly, since read name structures do change, albeit infrequently. > > I would recommend adding read groups to your SAM header within your pipeline. > We use FastqToSam or IlluminaBasecallsToSam to set the read group > appropriately depending on our inputs. In Picard, we also have tools like > AddOrReplaceReadGroups that can help you add read groups prior to marking > duplicates. > > Nils > ------------------------------------------------------------------------------ > Comprehensive Server Monitoring with Site24x7. > Monitor 10 servers for $9/Month. > Get alerted through email, SMS, voice calls or mobile push notifications. > Take corrective actions from your mobile device. > http://p.sf.net/sfu/Zoho_______________________________________________ > Samtools-help mailing list > Samtools-help@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/samtools-help ------------------------------------------------------------------------------ Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://p.sf.net/sfu/Zoho _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help