Hi! I'm getting VCF files with some strange INDELS from bcftools. The shell 
script below contains everything needed to reproduce it (I'm using bwa 
0.7.10-r789, samtools 1.1, and bcftools 1.1) and it outputs the following:

Lines for position 120 when the VCF is generated by "samtools mpileup":

MyRef   120     .       G       <X>     0       .       
DP=3;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0  PL      0,0,0
MyRef   120     .       G       GGAAT   0       .       
INDEL;IDV=2;IMF=0.666667;DP=3;I16=0,0,1,0,0,0,30,900,0,0,60,3600,0,0,25,625;QS=0,1;SGB=-0.379885;MQ0F=0
      PL      30,3,0

Lines for position 120 when the VCF is generated by "bcftools call":

MyRef   120     .       G       .       0       .       
DP=3;MQ0F=0;AN=0;DP4=0,0,0,0;MQ=.       GT      .
MyRef   120     .       G       .       2.1484  .       
INDEL;IDV=2;IMF=0.666667;DP=3;SGB=-0.379885;MQ0F=0;AN=1;DP4=0,0,1,0;MQ=60       
GT      0

Output from "samtools mpileup" around position 120:

MyRef   110     G       2       .,      mD
MyRef   111     T       2       .,      bD
MyRef   112     G       2       .,      eD
MyRef   113     T       2       .,      jD
MyRef   114     T       2       C,      kD
MyRef   115     G       1       .       m
MyRef   116     G       1       .       e
MyRef   117     G       1       .       j
MyRef   118     G       1       .       i
MyRef   119     A       1       .       k
MyRef   120     G       0
MyRef   121     A       0
MyRef   122     C       1       .       I
MyRef   123     T       1       .       k
MyRef   124     C       2       .,      gA
MyRef   125     A       2       Gg      dA
MyRef   126     G       2       .,      _A
MyRef   127     T       2       .,      iF
MyRef   128     G       2       .,      fH
MyRef   129     C       2       .,      jJ
MyRef   130     C       2       .,      iI

It's the output from "bcftools call" that has ALT="." while the "INDEL" flag is 
present that worries me. Is this a bug, or just something I don't understand 
(yet)?

The shell script to reproduce the above output:

#!/bin/bash

# Make directory for data
mkdir -p tmp-data
cd tmp-data

# Redirect stderr.
exec 2>stderr

# Use this short refence.
cat >ref <<EOF
>MyRef
GGTGGCGAAGGCGGCTTGCTGGACAAATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGGTGTTGGGGAG
ACTCAGTGCCGCAGCTAACGCAATAAGCACTCCGCCTGGGGAGTACGACCGCAAGGTTGA
EOF

# Use these reads.
cat >reads.fastq <<EOF
@Instrument:1:FlowCell:1:1:1:1/1
ACCTTGCGGTCGTACTCCCCAGGTGGAGTGCTTATTGCGTTTGCTGCGGCACCGACCATCTCTGGCCAACACCTAGCACTCATCGTTTACGGCGTGGA
+
CCCFFFFFHFHHGJJJJ3EHGIJ?FHDHFHJJIJJJJJJHIJJJJJJIJHFFDDDDDDDDDDEDCDDDDDDDDDDCCCCDDDDDDBDDDDDDDDDDDD
@Instrument:1:FlowCell:1:1:1:1/2
GGTGGCGAAGGCGGCTTACTGGACTGTTACTGACGCTGAGTCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACG
+
CCBFFFFFHHHHHJJJJJJIJJIIJJJIJJJJJIJJJIJJHHHHHFFFDE9=BDDDDDDDDDDDDDDDDDDDDEDDDDDDADDDEDDDBDDDDDDDD<
@Instrument:1:FlowCell:1:2:2:2/1
TCAACCTTGCGGTCGTACTCCCCAGGTGGAGTGCTTATTGCGTTAGCTGCGGCACCGAGGATTCCTCCCCGACACCTAGCACTCATCGTTTACGGCG
+
BCCFFFFFHHHGFHIJJGHIIIIHDH?GDFI*BGIHIIJGEGAEEHIHJIGHFFB?@DDBBCDCDCCDBDDDBBDB9CCDDD@?CDDDDDD?@@DBB
@Instrument:1:FlowCell:1:2:2:2/2
GGTAGTCCACGCCGTAAACGATGAGTGCTAGGTGTCGGGGAGGAATCCTCGGTGCCGCAGCTAACGCAATAAGCACTCCACCTGGGGAGTACGACCGC
+
BBBDFDF?FHHHFHHHIJIJJJIHIHGJIGIJADGHJDGGIGGGFHHHHHF>DACCDBDBDDDDDBDD@DDCDDDDDDDCDDDCBBB@DDCDDDBDBB
EOF

# Prepare indices of the reference.
samtools faidx ref
bwa index ref

# Align the reads to the reference.
bwa mem -p -t 4 -R '@RG\tID:MySample\tSM:MySample' ref reads.fastq > 
alignment.sam

# bwa-mem needs to infer the insert size distribution from data. You
# have to mix the read pair with at least tens of other pairs. For
# this reason bwa doesn't think the reads are properly paired, so we
# set the flags manually.
perl -pe '@_=split /\t/; if ($_[0] !~ /^@/) { @_[1] |= 2; $_=join("\t", @_) }' 
alignment.sam > fixed_alignment.sam

# Sort the alignment.
samtools sort -T samtools.sort fixed_alignment.sam -O bam > sorted_alignment.bam

# Index the sorted alignment.
samtools index sorted_alignment.bam

# Make a file containing the ploidy of the sample (for "bcftools call -m").
echo -e >sample.tab "MySample\t1"

# Generate vcf using samtools.
samtools mpileup -r MyRef:110-130 -vuf ref sorted_alignment.bam > 
pos110-130.samtools.vcf

# Generate vcf using bcftools.
samtools mpileup -r MyRef:110-130 -uf ref sorted_alignment.bam | bcftools call 
-m -S sample.tab -O v > pos110-130.bcftools.vcf

# Generate mpileup data.
samtools mpileup -r MyRef:110-130 -f ref sorted_alignment.bam > mpileup.tab

# Output results.
echo -e "\nLines for position 120 when the VCF is generated by \"samtools 
mpileup\":\n"
grep -P '^MyRef\t120' pos110-130.samtools.vcf
echo -e "\nLines for position 120 when the VCF is generated by \"bcftools 
call\":\n"
grep -P '^MyRef\t120' pos110-130.bcftools.vcf
echo -e "\nOutput from \"samtools mpileup\" around position 120:\n"
cat mpileup.tab
echo

### END OF SCRIPT


Best,

Rasmus Borup Hansen

Intomics is a contract research organization specialized in deriving core 
biological insight from large scale data. We help our clients in the 
pharmaceutical industry develop tomorrow's medicines better, faster, and 
cheaper through optimized use of biomedical data.
-----------------------------------------------------------------
Hansen, Rasmus Borup              Intomics - from data to biology
System Administrator              Diplomvej 377
Scientific Programmer             DK-2800 Kgs. Lyngby
                                  Denmark
E: r...@intomics.com               W: http://www.intomics.com/
P: +45 5167 7972                  P: +45 8880 7979

------------------------------------------------------------------------------
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to