Hello,
I am convinced that the optical duplicates count of the Picard MarkDuplicates
command is incorrect. When I wrote a script to detect optical duplicates in
my dataset, I got only ~1k optical duplicates as opposed to MarkDuplicates ~3
million. I think the problem with MarkDuplicates is tile related because I
then wrote a super simple script that simply counts how many duplicates share
the same tile, and that was < 4k, that is, 3 orders of magnitudes less than
MarkDuplicates! The overall number of duplicates (opticals or otherwise)
matched (~7 million). I'm convinced my script is right, as it's so simple.
Remove optical duplicates script:
https://gist.github.com/annasa/eef7c30152ac296bb49b
Count duplicates in same tile:
https://gist.github.com/annasa/f5633eecf012153a3ff2
Both scripts take as input a sam file sorted on chr and startPos. They also
assume that when the sequence name is parsed by ":" then the tile is the 5th
field, x the 6th and y the 7th (e.g. HWI-ST1318:119:H89A3ADXX:1:2209:1705:6933,
where tile is '2209', x is '1705' and y is'6933'). Finally, they assume that
the file is for a single lane, as I was working with such files.
This is VERY important for my lab. Please advise as soon as you can.
Thank you,
Anna
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help