We think this is the same bug as
https://github.com/samtools/htslib/issues/578

The fix for that was never put in place because we didn't manage to
get a response back, but changing the probability of match vs N to
0.25 instead of 1 does indeed cure this problem.

The cause of the wrong indel being considered as good is because the
probaln_glocal function returns a negative score (it's meant to be
phred-scaled, so that in theory is impossible).  This is due to a run
of Ns in the reference, but I can't explain quite how that equates to
negative logs!  (The alternative fix in bcftools is simply to turn
negative scores into very high positive ones indicating a bad
alignment.)

Eg as a patch:

*** bcftools-1.8/htslib-1.8/probaln.c   Mon Jul 31 09:15:20 2017
--- ./probaln.c                         Thu May  3 12:22:04 2018
***************
*** 134,140 ****
          int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
          for (k = beg, sum = 0.; k <= end; ++k) {
              int u;
!             double e = (ref[k - 1] > 3 || query[0] > 3)? 1. : ref[k - 1] == 
query[0]? 1. - qual[0] : qual[0] * EM;
              set_u(u, bw, 1, k);
              fi[u+0] = e * bM; fi[u+1] = EI * bI;
              sum += fi[u] + fi[u+1];
--- 134,140 ----
          int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
          for (k = beg, sum = 0.; k <= end; ++k) {
              int u;
!             double e = (ref[k - 1] > 3 || query[0] > 3)? .25 : ref[k - 1] == 
query[0]? 1. - qual[0] : qual[0] * EM;
              set_u(u, bw, 1, k);
              fi[u+0] = e * bM; fi[u+1] = EI * bI;
              sum += fi[u] + fi[u+1];
***************
*** 151,157 ****
          double E[] = {
              qli * EM, // 00
              1. - qli, // 01
!             1.,       // 10
              1.,       // 11
          };
          double M = 1./s[i-1];
--- 151,157 ----
          double E[] = {
              qli * EM, // 00
              1. - qli, // 01
!             .25,      // 10
              1.,       // 11
          };
          double M = 1./s[i-1];

There are still other issues with the code though, namely that it
counts the total number of alignments having an indel rather than the
number matching the reported indels.  This makes filtering
challenging.

So if we find 6 possible indels and emit 2 of them in the vcf file
then IDV and IMF should be based on the number of records that confirm
at least 1 of those 2, and not the other 4.  I'm trying to figure this
out at the moment.

I'm also doing a bcftools mpileup/call of an entire human chromosome
20 to see what differences the N tweak has.

James

-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to