Hi James,

Thank you very much for looking in to this. I will try apply your suggested
patch and see if it improves the calling.
In the short term I can try do some prefiltering on the BAM file to remove
reads with large indels to see if we see the same effect with smaller
indels.

I am quite interested to know the results from the human chromosome.
Please let me know if there is anything I can do from my side.




On Thu, May 3, 2018 at 12:31 PM, James Bonfield <j...@sanger.ac.uk> wrote:

> We think this is the same bug as
> https://github.com/samtools/htslib/issues/578
>
> The fix for that was never put in place because we didn't manage to
> get a response back, but changing the probability of match vs N to
> 0.25 instead of 1 does indeed cure this problem.
>
> The cause of the wrong indel being considered as good is because the
> probaln_glocal function returns a negative score (it's meant to be
> phred-scaled, so that in theory is impossible).  This is due to a run
> of Ns in the reference, but I can't explain quite how that equates to
> negative logs!  (The alternative fix in bcftools is simply to turn
> negative scores into very high positive ones indicating a bad
> alignment.)
>
> Eg as a patch:
>
> *** bcftools-1.8/htslib-1.8/probaln.c   Mon Jul 31 09:15:20 2017
> --- ./probaln.c                         Thu May  3 12:22:04 2018
> ***************
> *** 134,140 ****
>           int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
>           for (k = beg, sum = 0.; k <= end; ++k) {
>               int u;
> !             double e = (ref[k - 1] > 3 || query[0] > 3)? 1. : ref[k - 1]
> == query[0]? 1. - qual[0] : qual[0] * EM;
>               set_u(u, bw, 1, k);
>               fi[u+0] = e * bM; fi[u+1] = EI * bI;
>               sum += fi[u] + fi[u+1];
> --- 134,140 ----
>           int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
>           for (k = beg, sum = 0.; k <= end; ++k) {
>               int u;
> !             double e = (ref[k - 1] > 3 || query[0] > 3)? .25 : ref[k -
> 1] == query[0]? 1. - qual[0] : qual[0] * EM;
>               set_u(u, bw, 1, k);
>               fi[u+0] = e * bM; fi[u+1] = EI * bI;
>               sum += fi[u] + fi[u+1];
> ***************
> *** 151,157 ****
>           double E[] = {
>               qli * EM, // 00
>               1. - qli, // 01
> !             1.,       // 10
>               1.,       // 11
>           };
>           double M = 1./s[i-1];
> --- 151,157 ----
>           double E[] = {
>               qli * EM, // 00
>               1. - qli, // 01
> !             .25,      // 10
>               1.,       // 11
>           };
>           double M = 1./s[i-1];
>
> There are still other issues with the code though, namely that it
> counts the total number of alignments having an indel rather than the
> number matching the reported indels.  This makes filtering
> challenging.
>
> So if we find 6 possible indels and emit 2 of them in the vcf file
> then IDV and IMF should be based on the number of records that confirm
> at least 1 of those 2, and not the other 4.  I'm trying to figure this
> out at the moment.
>
> I'm also doing a bcftools mpileup/call of an entire human chromosome
> 20 to see what differences the N tweak has.
>
> James
>
> --
> James Bonfield (j...@sanger.ac.uk)
> The Sanger Institute, Hinxton, Cambs, CB10 1SA
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>



-- 
Jody Phelan
64 Queen Alexandra Mansions,Hastings St
London, WC1H 9DR, UK
Tel: +447478609078
Email: jodyphe...@gmail.com
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to