I just realized that I didn't reply-all on my reply to Florian! For the
sake of the list:
I already replied to this on the seqanswers thread, but the MD tag doesn't
store insertions. This tag is reference-centric, so you'll only find
information on matches, mismatches, and deletions (versus the reference) in
it. Perhaps that footnote should be modified a bit.
Best,
Devon
--
Devon Ryan, Ph.D.
Email: dpr...@dpryan.com
Laboratory for Molecular and Cellular Cognition
German Centre for Neurodegenerative Diseases (DZNE)
Ludwig-Erhard-Allee 2
53175 Bonn
Germany
<devon.r...@dzne.de>
On Wed, Oct 22, 2014 at 9:33 AM, Florian Aldehoff <aldeh...@web.de> wrote:
> Hello,
>
> I hope this list is the right place for my question regarding the SAM
> file format. I have also posted it to
>
> http://seqanswers.com/forums/showthread.php?t=47637
>
> just in case. Please point me to better sites or mailing list if this
> does not belong here.
>
> I'm trying to wrap my head around the optional MD tag in SAM files
> because a tool in my processing pipeline relies on this tag. In theory
> it should allow me to call SNPs/indels without looking up the reference
> sequence for a read. An example MD tag from a file I'm dealing with is
>
> MD:Z:2A11G14G7G9^C3
>
> read: GAGGAACCTTACCAAGGCTTGACATGTAGCTGCAAGCGCACGGAAACGTGTG
> CIGAR: 32M1I5M1I10M1D3M
>
> Now while the sum of the CIGAR M/I/S/=/X operations correctly equals the
> length of the read (52 bases, 53 when also considering the deletion/gap
> at position 50), I only get to 51 reference bases when I attempt to
> (manually) reconstruct the reference from the MD tag alone:
>
> 2A11G14G7G9^C3
>
> in a "decompressed" form becomes
>
> ==A===========G==============G=======G=========-===
>
> becomes the following reference sequence (first line) as compared to the
> true reference sequence (second line):
>
> GAAGAACCTTACCAGGGCTTGACATGTAGGTGCAAGCGCACGGAAACCTGT
> GAAGAACCTTACCAGGGCTTGACATGTAGGTG-AAGCG-GCGGAAACGTCGTG
>
> The difference in length as well as the shift in the sequence both seem
> arise from the lack of a notation for the two gaps in the reference
> (positions 33 and 39).
>
> Now, am I just misunderstanding the MD tag? Do I always have to consider
> both the CIGAR string AND the MD tag to infer the reference sequence? Or
> is there a notation for gaps in the reference that I simply have
> overlooked in the SAM specification? What I've found so far is the Regex
> or permitted characters on page 6:
>
> [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*
>
> and the footnote on page 7 claiming that the MD field ought to match the
> CIGAR string (which it obviously doesn't in my example).
>
> Thank you a lot for any insight and clarification!
>
> Florian Aldehoff
>
>
> ------------------------------------------------------------------------------
> Comprehensive Server Monitoring with Site24x7.
> Monitor 10 servers for $9/Month.
> Get alerted through email, SMS, voice calls or mobile push notifications.
> Take corrective actions from your mobile device.
> http://p.sf.net/sfu/Zoho
> _______________________________________________
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>
------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help