On 22 Jul 2016, at 21:49, Annie Cowell <annie.cow...@gmail.com> wrote: > samtools mpileup -C50 -Bug -t AD -Q10 -f my.fasta sample.bam | bcftools call > -mv -Ov > my.vcf > > However, I am getting unusual bases for a couple of indels in the reference > field, such as W (as below) and R. > > AAKM01000065 8883 . GGW G 228 . > INDEL;IDV=40;IMF=0.888889;DP=45;VDB=0.880501;SGB=-0.692976;MQSB=0.114162;MQ0F=0;AC=2;AN=2;DP4=0,0,8,18;MQ=49 > GT:PL:AD 1/1:255,78,0:0,26
As Thomas guessed, these are indeed ambiguity codes coming from the reference genome. From http://www.ncbi.nlm.nih.gov/nuccore/AAKM01000065 we can see the GGW bases at location 8883 onwards: 8881 aaggwtataa agaacgcata tatgcctttt gtcccctgcc cgcgtctgga tactcgctac The issue of IUPAC ambiguity codes in VCF REF fields came up a while ago [1], and the consensus expressed in the VCF v4.3 spec is that they shouldn't appear -- instead being simplified to a non-ambiguous base, in this case GGA. So this is an mpileup and/or bcftools bug, and apparently it should be suppressing ambiguity codes by default. John [1] https://github.com/samtools/hts-specs/issues/54 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports.http://sdm.link/zohodev2dev _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help