Does anyone have any recommended parameters for filtering bcftools indel calls?
It makes a *lot* of indel false positives and filtering on QUAL isn't great compared to GATK / FreeBayes results, however I noticed IDV and IMF info fields give a strong way to separate false positives from true positives. For example with the SynDip truth set (CHM1 + CHM13) straight bcftools gives 439,484 true positives (TP) and 181,793 false positives (FP) unfiltered. Filtered by QUAL >= 30 changes this to TP 417,136, FP 163,720 - so not much at all. Filtering instead on IDV >=3 && IMG >= 0.3 gives TP 436,037, FP 20,708. IDV >= 6 && IMF >= 0.1 gives TP 426,163, FP 11,400. Given the total number of true indels in the syndip truth set, this means we went from 79.0% recall 67.3% precision to 76.6% recall 98.0% precision. These are emormously better metrics than QUAL for discriminating between correct and incorrect results, but they appear to be completely undocumented other than in the header of the VCF file. I've tried a few other parameters, but haven't had such good results, but in theory they could all be combined together in some phred-style classifier system, similar to VQSR with GATK. Has anyone done this already? If not, do peple have specific hard-filtering parameters they use? James -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help