We think this is the same bug as https://github.com/samtools/htslib/issues/578
The fix for that was never put in place because we didn't manage to get a response back, but changing the probability of match vs N to 0.25 instead of 1 does indeed cure this problem. The cause of the wrong indel being considered as good is because the probaln_glocal function returns a negative score (it's meant to be phred-scaled, so that in theory is impossible). This is due to a run of Ns in the reference, but I can't explain quite how that equates to negative logs! (The alternative fix in bcftools is simply to turn negative scores into very high positive ones indicating a bad alignment.) Eg as a patch: *** bcftools-1.8/htslib-1.8/probaln.c Mon Jul 31 09:15:20 2017 --- ./probaln.c Thu May 3 12:22:04 2018 *************** *** 134,140 **** int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1; for (k = beg, sum = 0.; k <= end; ++k) { int u; ! double e = (ref[k - 1] > 3 || query[0] > 3)? 1. : ref[k - 1] == query[0]? 1. - qual[0] : qual[0] * EM; set_u(u, bw, 1, k); fi[u+0] = e * bM; fi[u+1] = EI * bI; sum += fi[u] + fi[u+1]; --- 134,140 ---- int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1; for (k = beg, sum = 0.; k <= end; ++k) { int u; ! double e = (ref[k - 1] > 3 || query[0] > 3)? .25 : ref[k - 1] == query[0]? 1. - qual[0] : qual[0] * EM; set_u(u, bw, 1, k); fi[u+0] = e * bM; fi[u+1] = EI * bI; sum += fi[u] + fi[u+1]; *************** *** 151,157 **** double E[] = { qli * EM, // 00 1. - qli, // 01 ! 1., // 10 1., // 11 }; double M = 1./s[i-1]; --- 151,157 ---- double E[] = { qli * EM, // 00 1. - qli, // 01 ! .25, // 10 1., // 11 }; double M = 1./s[i-1]; There are still other issues with the code though, namely that it counts the total number of alignments having an indel rather than the number matching the reported indels. This makes filtering challenging. So if we find 6 possible indels and emit 2 of them in the vcf file then IDV and IMF should be based on the number of records that confirm at least 1 of those 2, and not the other 4. I'm trying to figure this out at the moment. I'm also doing a bcftools mpileup/call of an entire human chromosome 20 to see what differences the N tweak has. James -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help