Hi James,
Thank you very much for looking in to this. I will try apply your suggested
patch and see if it improves the calling.
In the short term I can try do some prefiltering on the BAM file to remove
reads with large indels to see if we see the same effect with smaller
indels.
I am quite interested to know the results from the human chromosome.
Please let me know if there is anything I can do from my side.
On Thu, May 3, 2018 at 12:31 PM, James Bonfield <j...@sanger.ac.uk> wrote:
> We think this is the same bug as
> https://github.com/samtools/htslib/issues/578
>
> The fix for that was never put in place because we didn't manage to
> get a response back, but changing the probability of match vs N to
> 0.25 instead of 1 does indeed cure this problem.
>
> The cause of the wrong indel being considered as good is because the
> probaln_glocal function returns a negative score (it's meant to be
> phred-scaled, so that in theory is impossible). This is due to a run
> of Ns in the reference, but I can't explain quite how that equates to
> negative logs! (The alternative fix in bcftools is simply to turn
> negative scores into very high positive ones indicating a bad
> alignment.)
>
> Eg as a patch:
>
> *** bcftools-1.8/htslib-1.8/probaln.c Mon Jul 31 09:15:20 2017
> --- ./probaln.c Thu May 3 12:22:04 2018
> ***************
> *** 134,140 ****
> int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
> for (k = beg, sum = 0.; k <= end; ++k) {
> int u;
> ! double e = (ref[k - 1] > 3 || query[0] > 3)? 1. : ref[k - 1]
> == query[0]? 1. - qual[0] : qual[0] * EM;
> set_u(u, bw, 1, k);
> fi[u+0] = e * bM; fi[u+1] = EI * bI;
> sum += fi[u] + fi[u+1];
> --- 134,140 ----
> int beg = 1, end = l_ref < bw + 1? l_ref : bw + 1;
> for (k = beg, sum = 0.; k <= end; ++k) {
> int u;
> ! double e = (ref[k - 1] > 3 || query[0] > 3)? .25 : ref[k -
> 1] == query[0]? 1. - qual[0] : qual[0] * EM;
> set_u(u, bw, 1, k);
> fi[u+0] = e * bM; fi[u+1] = EI * bI;
> sum += fi[u] + fi[u+1];
> ***************
> *** 151,157 ****
> double E[] = {
> qli * EM, // 00
> 1. - qli, // 01
> ! 1., // 10
> 1., // 11
> };
> double M = 1./s[i-1];
> --- 151,157 ----
> double E[] = {
> qli * EM, // 00
> 1. - qli, // 01
> ! .25, // 10
> 1., // 11
> };
> double M = 1./s[i-1];
>
> There are still other issues with the code though, namely that it
> counts the total number of alignments having an indel rather than the
> number matching the reported indels. This makes filtering
> challenging.
>
> So if we find 6 possible indels and emit 2 of them in the vcf file
> then IDV and IMF should be based on the number of records that confirm
> at least 1 of those 2, and not the other 4. I'm trying to figure this
> out at the moment.
>
> I'm also doing a bcftools mpileup/call of an entire human chromosome
> 20 to see what differences the N tweak has.
>
> James
>
> --
> James Bonfield (j...@sanger.ac.uk)
> The Sanger Institute, Hinxton, Cambs, CB10 1SA
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
--
Jody Phelan
64 Queen Alexandra Mansions,Hastings St
London, WC1H 9DR, UK
Tel: +447478609078
Email: jodyphe...@gmail.com
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help