Could you please provide a place where I can copy them to?  Thanks!

From: Nils Homer [mailto:nho...@broadinstitute.org]
Sent: Thursday, October 16, 2014 1:20 PM
To: Salzberg, Anna
Cc: samtools-help@lists.sourceforge.net
Subject: Re: [Samtools-help] Reporting Bug - Optical Duplicates of Picard 
MarkDuplicates

Hey Anna,

could you provide me with the inputs (i.e BAM and FASTA) to the MarkDuplicates 
tool so we can reproduce and debug?  Having a reduced test case would be ideal 
as I cannot debug easily from the information you provided.  Thanks!

Nils

On Thu, Oct 16, 2014 at 1:10 PM, Salzberg, Anna 
<asalzb...@hmc.psu.edu<mailto:asalzb...@hmc.psu.edu>> wrote:
Nils,

I have identified a bug with optical duplicates that is not fixed in picard v 
1.122.

In particular, I ran bwa and then picard MarkDuplicates on lane 1 and lane 2 
files separately, and I then did the same for the merged fastq files.  I got 
the picard MarkDuplicates results below.  Note that the overall % duplication 
is higher in the merged file than in the individual lane ones (to be expected), 
but that the OPTICAL duplicates is also higher, that is, it is more than the 
sum of the OPTICAL duplicates of lanes 1 and 2 (NOT to be expected).

In other words, it is expected that the overall duplication is higher than the 
sum of the ones in lanes 1 and 2 due to interaction between the reads of lanes 
1 and 2 (for ex., read1 from lane1 could have no duplicates in lane1, and read2 
from lane2 could have no duplicate in lane2 but in the merged file they could 
be duplicates of each other).  However, by definition, there should be no 
interaction in the OPTICAL duplicates between lanes 1 and 2, and therefore the 
results below seem wrong.  My suspicion is that the MarkDuplicates optical 
duplication algorithm is not checking for the lane.

Sample                 LIBRARY               UNPAIRED_READS_EXAMINED 
READ_PAIRS_EXAMINED             UNMAPPED_READS                
UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES             
READ_PAIR_OPTICAL_DUPLICATES                PERCENT_DUPLICATION               
ESTIMATED_LIBRARY_SIZE
A_merged          Unknown Library             497938  21072701             
939669  326848  9821728                3875890                0.46831           
      18726206
A_L1                      Unknown Library             247592  10518632          
   458565  141958  3109570<tel:3109570>                1074927                
0.298856                18640119
A_L2                      Unknown Library             250347  10554068          
   481105  147968  3130326                1080576                0.30005        
         18605087


Thank you for looking into this!
Anna

From: Salzberg, Anna
Sent: Friday, October 10, 2014 11:33 AM
To: 'Nils Homer'
Cc: 
'samtools-help@lists.sourceforge.net<mailto:samtools-help@lists.sourceforge.net>'
Subject: RE: [Samtools-help] Reporting Bug - Optical Duplicates of Picard 
MarkDuplicates

Nils,

I see I made a mistake in my countTileDups script, in that I report the wrong 
variable in the end!!  The number of dups in the same tile is in fact > than 
you reported optical duplicates.

So please ignore my request for now.

Thank you,
Anna

From: Salzberg, Anna
Sent: Friday, October 10, 2014 10:26 AM
To: 'Nils Homer'
Cc: 
samtools-help@lists.sourceforge.net<mailto:samtools-help@lists.sourceforge.net>
Subject: RE: [Samtools-help] Reporting Bug - Optical Duplicates of Picard 
MarkDuplicates

Dear Nils,

I installed Picard 1.122, and the number of optical duplicates was reduced by 
over 25% (the estimated library size was also different in the new version).

Unfortunately, I still think that there is a bug with the number of optical 
duplicates, as simply counting the number of duplicates that have the same tile 
results in 3 orders of magnitude less than the MarkDuplicates optical 
duplicates count.


I would *greatly* appreciate if you could look into this as this is super 
important to my lab.  I have provided in my previous email 2 scripts; one of 
them is a very simple script (only a few lines) that simply counts duplicates 
with the same tile.

Thank you very much for your help with this issue.

Anna

From: Nils Homer [mailto:nho...@broadinstitute.org]
Sent: Thursday, October 09, 2014 4:34 PM
To: Salzberg, Anna
Cc: 
samtools-help@lists.sourceforge.net<mailto:samtools-help@lists.sourceforge.net>
Subject: Re: [Samtools-help] Reporting Bug - Optical Duplicates of Picard 
MarkDuplicates

I am replying to the list so others can benefit from our discussion.

The latest Picard release to support updated Illumina read names is 1.120 while 
your install is 1.99.  You will need to update to this version or the latest 
version to get the benefit of this update.

Nils

On Thu, Oct 9, 2014 at 4:08 PM, Nils Homer 
<nho...@broadinstitute.org<mailto:nho...@broadinstitute.org>> wrote:
Could you tell us what version of Picard you are using?  There was an issue 
earlier with parsing read names from newer Illumina analysis software.

Nils

On Thu, Oct 9, 2014 at 3:00 PM, Salzberg, Anna 
<asalzb...@hmc.psu.edu<mailto:asalzb...@hmc.psu.edu>> wrote:
Hello,

I am convinced that the optical duplicates count of the Picard MarkDuplicates 
command is incorrect.  When I wrote a script to  detect optical duplicates in 
my dataset, I got only ~1k optical duplicates as opposed to MarkDuplicates ~3 
million.  I think the problem with MarkDuplicates is tile related because I 
then wrote a super simple script that simply counts how many duplicates share 
the same tile, and that was < 4k, that is, 3 orders of magnitudes less than 
MarkDuplicates!  The overall number of duplicates (opticals or otherwise) 
matched (~7 million).  I'm convinced my script is right, as it's so simple.

Remove optical duplicates script:
https://gist.github.com/annasa/eef7c30152ac296bb49b

Count duplicates in same tile:
https://gist.github.com/annasa/f5633eecf012153a3ff2

Both scripts take as input a sam file sorted on chr and startPos. They also 
assume that when the sequence name is parsed by ":" then the tile is the 5th 
field, x the 6th and y the 7th (e.g. HWI-ST1318:119:H89A3ADXX:1:2209:1705:6933, 
where tile is '2209', x is '1705' and y is'6933').  Finally, they assume that 
the file is for a single lane, as I was working with such files.

This is VERY important for my lab.  Please advise as soon as you can.

Thank you,

Anna


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net<mailto:Samtools-help@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/samtools-help



------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to