Thanks Anna for the example set. I have observed a few things regarding
this issue
The first is that we do not extract the lane # from the read name, only
tile, x-coordinate, and y-coordinate. You can see this in the code here if
you are interested:
https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java#L84-L104
Secondly, we also do not retrieve either the barcode information or library
identifier in the read name, since they themselves are not embedded in the
read name. Both barcode and library identifier are also important to
condition upon when searching for optical duplicates, or duplicates in
general.
This brings us to where *do* we expect to retrieve this information? We
use the read group header lines to capture lane, barcode, library, flowcell
(for Illumina) and other information for specific sets or groups of reads.
If this information is given, which I recommend that as a best practice it
should, MarkDuplicates will behave as you expect. I believe it is much
more robust to annotate these metadata in the header rather than rely on
parsing read names wholly, since read name structures do change, albeit
infrequently.
I would recommend adding read groups to your SAM header within your
pipeline. We use FastqToSam or IlluminaBasecallsToSam to set the read
group appropriately depending on our inputs. In Picard, we also have tools
like AddOrReplaceReadGroups that can help you add read groups prior to
marking duplicates.
Nils
On Thu, Oct 16, 2014 at 3:52 PM, Salzberg, Anna <asalzb...@hmc.psu.edu>
wrote:
> Hi,
>
>
>
> Attached is a very small illustration set.
>
>
>
> Anna
>
>
>
> *From:* Nils Homer [mailto:nho...@broadinstitute.org]
> *Sent:* Thursday, October 16, 2014 2:30 PM
>
> *To:* Salzberg, Anna
> *Cc:* samtools-help@lists.sourceforge.net
> *Subject:* Re: [Samtools-help] Reporting Bug - Optical Duplicates of
> Picard MarkDuplicates
>
>
>
> Hey Anna,
>
>
>
> I am sorry but I do not have such a location. My suggestion would be to
> reduce the size of the input BAM until you only have a few such reads, then
> you can send over email.
>
>
>
> N
>
>
>
> On Thu, Oct 16, 2014 at 1:49 PM, Salzberg, Anna <asalzb...@hmc.psu.edu>
> wrote:
>
> Could you please provide a place where I can copy them to? Thanks!
>
>
>
> *From:* Nils Homer [mailto:nho...@broadinstitute.org]
> *Sent:* Thursday, October 16, 2014 1:20 PM
>
>
> *To:* Salzberg, Anna
> *Cc:* samtools-help@lists.sourceforge.net
> *Subject:* Re: [Samtools-help] Reporting Bug - Optical Duplicates of
> Picard MarkDuplicates
>
>
>
> Hey Anna,
>
>
>
> could you provide me with the inputs (i.e BAM and FASTA) to the
> MarkDuplicates tool so we can reproduce and debug? Having a reduced test
> case would be ideal as I cannot debug easily from the information you
> provided. Thanks!
>
>
>
> Nils
>
>
>
> On Thu, Oct 16, 2014 at 1:10 PM, Salzberg, Anna <asalzb...@hmc.psu.edu>
> wrote:
>
> Nils,
>
>
>
> I have identified a bug with optical duplicates that is not fixed in
> picard v 1.122.
>
>
>
> In particular, I ran bwa and then picard MarkDuplicates on lane 1 and lane
> 2 files separately, and I then did the same for the merged fastq files. I
> got the picard MarkDuplicates results below. Note that the overall %
> duplication is higher in the merged file than in the individual lane ones
> (to be expected), but that the OPTICAL duplicates is also higher, that is,
> it is more than the sum of the OPTICAL duplicates of lanes 1 and 2 (NOT to
> be expected).
>
>
>
> In other words, it is expected that the overall duplication is higher than
> the sum of the ones in lanes 1 and 2 due to interaction between the reads
> of lanes 1 and 2 (for ex., read1 from lane1 could have no duplicates in
> lane1, and read2 from lane2 could have no duplicate in lane2 but in the
> merged file they could be duplicates of each other). However, by
> definition, there should be no interaction in the OPTICAL duplicates
> between lanes 1 and 2, and therefore the results below seem wrong. My
> suspicion is that the MarkDuplicates optical duplication algorithm is not
> checking for the lane.
>
>
>
> Sample LIBRARY UNPAIRED_READS_EXAMINED
> READ_PAIRS_EXAMINED UNMAPPED_READS
> UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES
> READ_PAIR_OPTICAL_DUPLICATES
> PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
>
> A_merged Unknown Library 497938 21072701
> 939669 326848 9821728 3875890 0.46831
> 18726206
>
> A_L1 Unknown Library 247592
> 10518632 458565 141958 3109570
> 1074927 0.298856 18640119
>
> A_L2 Unknown Library 250347
> 10554068 481105 147968 3130326
> 1080576 0.30005 18605087
>
>
>
>
>
> Thank you for looking into this!
> Anna
>
>
>
> *From:* Salzberg, Anna
> *Sent:* Friday, October 10, 2014 11:33 AM
> *To:* 'Nils Homer'
> *Cc:* 'samtools-help@lists.sourceforge.net'
> *Subject:* RE: [Samtools-help] Reporting Bug - Optical Duplicates of
> Picard MarkDuplicates
>
>
>
> Nils,
>
>
>
> I see I made a mistake in my countTileDups script, in that I report the
> wrong variable in the end!! The number of dups in the same tile is in fact
> > than you reported optical duplicates.
>
>
>
> So please ignore my request for now.
>
>
>
> Thank you,
>
> Anna
>
>
>
> *From:* Salzberg, Anna
> *Sent:* Friday, October 10, 2014 10:26 AM
> *To:* 'Nils Homer'
> *Cc:* samtools-help@lists.sourceforge.net
> *Subject:* RE: [Samtools-help] Reporting Bug - Optical Duplicates of
> Picard MarkDuplicates
>
>
>
> Dear Nils,
>
>
>
> I installed Picard 1.122, and the number of optical duplicates was reduced
> by over 25% (the estimated library size was also different in the new
> version).
>
>
>
> Unfortunately, I still think that there is a bug with the number of
> optical duplicates, as simply counting the number of duplicates that have
> the same tile results in 3 orders of magnitude less than the MarkDuplicates
> optical duplicates count.
>
>
>
>
>
> I would *greatly* appreciate if you could look into this as this is super
> important to my lab. I have provided in my previous email 2 scripts; one
> of them is a very simple script (only a few lines) that simply counts
> duplicates with the same tile.
>
>
>
> Thank you very much for your help with this issue.
>
>
>
> Anna
>
>
>
> *From:* Nils Homer [mailto:nho...@broadinstitute.org
> <nho...@broadinstitute.org>]
> *Sent:* Thursday, October 09, 2014 4:34 PM
> *To:* Salzberg, Anna
> *Cc:* samtools-help@lists.sourceforge.net
> *Subject:* Re: [Samtools-help] Reporting Bug - Optical Duplicates of
> Picard MarkDuplicates
>
>
>
> I am replying to the list so others can benefit from our discussion.
>
>
>
> The latest Picard release to support updated Illumina read names is 1.120
> while your install is 1.99. You will need to update to this version or the
> latest version to get the benefit of this update.
>
>
>
> Nils
>
>
>
> On Thu, Oct 9, 2014 at 4:08 PM, Nils Homer <nho...@broadinstitute.org>
> wrote:
>
> Could you tell us what version of Picard you are using? There was an
> issue earlier with parsing read names from newer Illumina analysis software.
>
>
>
> Nils
>
>
>
> On Thu, Oct 9, 2014 at 3:00 PM, Salzberg, Anna <asalzb...@hmc.psu.edu>
> wrote:
>
> Hello,
>
>
>
> I am convinced that the optical duplicates count of the Picard
> MarkDuplicates command is incorrect. When I wrote a script to detect
> optical duplicates in my dataset, I got only ~1k optical duplicates as
> opposed to MarkDuplicates ~3 million. I think the problem with
> MarkDuplicates is tile related because I then wrote a super simple script
> that simply counts how many duplicates share the same tile, and that was <
> 4k, that is, 3 orders of magnitudes less than MarkDuplicates! The overall
> number of duplicates (opticals or otherwise) matched (~7 million). I'm
> convinced my script is right, as it's so simple.
>
>
>
> Remove optical duplicates script:
>
> https://gist.github.com/annasa/eef7c30152ac296bb49b
>
>
>
> Count duplicates in same tile:
>
> https://gist.github.com/annasa/f5633eecf012153a3ff2
>
> Both scripts take as input a sam file sorted on chr and startPos. They
> also assume that when the sequence name is parsed by ":" then the tile is
> the 5th field, x the 6th and y the 7th (e.g.
> HWI-ST1318:119:H89A3ADXX:1:2209:1705:6933, where tile is '2209', x is
> '1705' and y is'6933'). Finally, they assume that the file is for a single
> lane, as I was working with such files.
>
> This is VERY important for my lab. Please advise as soon as you can.
>
> Thank you,
>
> Anna
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
> Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
> Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
> Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
>
> http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
> _______________________________________________
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>
>
>
>
>
>
>
>
>
------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help