Does anyone have any recommended parameters for filtering bcftools
indel calls?

It makes a *lot* of indel false positives and filtering on QUAL isn't
great compared to GATK / FreeBayes results, however I noticed IDV and
IMF info fields give a strong way to separate false positives from
true positives.

For example with the SynDip truth set (CHM1 + CHM13) straight bcftools
gives 439,484 true positives (TP) and 181,793 false positives (FP)
unfiltered.  Filtered by QUAL >= 30 changes this to TP 417,136, FP
163,720 - so not much at all.

Filtering instead on IDV >=3 && IMG >= 0.3 gives TP 436,037, FP 20,708.
IDV >= 6 && IMF >= 0.1 gives TP 426,163, FP 11,400.  Given the total
number of true indels in the syndip truth set, this means we went from
79.0% recall 67.3% precision to 76.6% recall 98.0% precision.

These are emormously better metrics than QUAL for discriminating
between correct and incorrect results, but they appear to be
completely undocumented other than in the header of the VCF file.

I've tried a few other parameters, but haven't had such good results,
but in theory they could all be combined together in some phred-style
classifier system, similar to VQSR with GATK.  Has anyone done this
already?  If not, do peple have specific hard-filtering parameters
they use?

James

-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to