Hello,

I hope this list is the right place for my question regarding the SAM
file format. I have also posted it to

http://seqanswers.com/forums/showthread.php?t=47637

just in case. Please point me to better sites or mailing list if this
does not belong here.

I'm trying to wrap my head around the optional MD tag in SAM files
because a tool in my processing pipeline relies on this tag. In theory
it should allow me to call SNPs/indels without looking up the reference
sequence for a read. An example MD tag from a file I'm dealing with is

MD:Z:2A11G14G7G9^C3

read: GAGGAACCTTACCAAGGCTTGACATGTAGCTGCAAGCGCACGGAAACGTGTG
CIGAR: 32M1I5M1I10M1D3M

Now while the sum of the CIGAR M/I/S/=/X operations correctly equals the
length of the read (52 bases, 53 when also considering the deletion/gap
at position 50), I only get to 51 reference bases when I attempt to
(manually) reconstruct the reference from the MD tag alone:

2A11G14G7G9^C3

in a "decompressed" form becomes

==A===========G==============G=======G=========-===

becomes the following reference sequence (first line) as compared to the
true reference sequence (second line):

GAAGAACCTTACCAGGGCTTGACATGTAGGTGCAAGCGCACGGAAACCTGT
GAAGAACCTTACCAGGGCTTGACATGTAGGTG-AAGCG-GCGGAAACGTCGTG

The difference in length as well as the shift in the sequence both seem
arise from the lack of a notation for the two gaps in the reference
(positions 33 and 39).

Now, am I just misunderstanding the MD tag? Do I always have to consider
both the CIGAR string AND the MD tag to infer the reference sequence? Or
is there a notation for gaps in the reference that I simply have
overlooked in the SAM specification? What I've found so far is the Regex
or permitted characters on page 6:

[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*

and the footnote on page 7 claiming that the MD field ought to match the
CIGAR string (which it obviously doesn't in my example).

Thank you a lot for any insight and clarification!

Florian Aldehoff

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to