Urs -Just guessing, since I haven't actually read this part of the samtools code. It's likely that the model used for genotype calling with mutliple samples is that these samples are a sample of unrelated individuals from a single, homogeneous, randomly mating population. As a consequence, Hardy-Weinberg proportions for genotypes at each locus are expected, and it makes sense to estimate the population allele frequency from the sample provided and to use an allele frequency prior for genotype calling. This will strongly downweight hom-alt genotype calls at a site (like this one) with very low estimated allele frequency.
It's a bit clunky, but here's what I would do in order to avoid this behavior. I would make a file showing the list of all sites where a variant was called in the multi-sample calling that you've already done. Then I would re-run mpileup on each sample one at a time, supplying the site list with argument '-l' to get single sample genotype calls. Then merge all 13 single sample .bcf or .vcf files using bcftools. When calling a single sample showing an apparent hom-alt genotype, the apparent allele frequency will be 1.0 and there will be strong SUPPORT for the hom-alt genotype, rather than population evidence against it.
That said, I not recommend doing analysis using the discrete genotype calls from a region with only 3 or 4 covering reads. Analysis should be done using the genotype likelihoods themselves.
- tom blackwell - On Mon, 8 May 2017, urs.lahrm...@item.fraunhofer.de wrote:
Hi, I have used mpileup on a set of 13 samples and find occasionally a predicted GT that imo does not fit to the provided PL's. So far I have seen this only in low coverage regions and I haven't analysed it systematically. Nevertheless I would like to understand what's going on or if I miss something. An example of my data: chr1 19202896 0 G A 65 PASS DP=36;VDB=0.820274;SGB=6.18608;RPB=0.894839;MQB=1;MQSB=1;BQB=0.972604;MQ0F=0;ICB=0.0072327;HOB=0.00347222;AC=1;AN=24;DP4=14,13,0,4;MQ=60 GT:PL:DP:ADF:ADR:AD 0/0:0,3,60:1:1,0:0,0:1,0 0/0:0,3,33:1:0,0:1,0:1,0 0/0:0,12,182:4:3,0:1,0:4,0 0/0:0,3,37:1:1,0:0,0:1,0 0/0:0,6,64:2:2,0:0,0:2,0 ./.:0,0,0:0:0,0:0,0:0,0 ./.:0,0,0:0:0,0:0,0:0,0 0/0:0,12,135:4:2,0:2,0:4,0 0/0:0,6,67:2:2,0:0,0:2,0 0/0:0,9,78:3:0,0:3,0:3,0 0/0:0,9,76:3:0,0:3,0:3,0 0/0:0,12,112:4:2,0:2,0:4,0 0/1:113,12,0:4:0,0:0,4:0,4 0/0:0,6,97:2:1,0:1,0:2,0 So, the second before last GT ist het (0/1), but PL is "113, 12, 0" and with that should be 1/1?! The only explanation I currently have is: a present, yet not sequenced, reference allele is implied because of other samples are hom ref. If this is true, please tell me which parameter turns it off as it would be stupid to use such a setting with my current sample collective. If it is something else, please enlighten me :) Best regards, Urs
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help