On Tue, Mar 06, 2018 at 09:57:25PM +1100, Jing meng wrote: > 1. > > at the read base column, a symbol ^' marks the start of a read segment > which is a contiguous subsequence on the read separated byN/S/H' CIGAR > operations. The ASCII of the character following `^' minus 33 gives the > mapping quality. > > The base that is the start of a read segment which is a contiguous > subsequence on the read separated by `N/S/H' CIGAR operations is before the > symbol '^' or after the symbol and the ASCII?
It's after. This is trivially visible if you do a samtools mpileup of a file. Eg the first line in my test file shows: 1 10001 N 1 ^#T A (By luck, "A" happens to be a quality value here, not a base call). > 2. > A symbol `$' marks the end of a read segment. > > The base that is the end of a read segment is before or after the symbol > '$'? In this case it's before. A gain just look at the end of an mpileup output for example (using -r region to speed it up helps). Or just make a 1 read SAM file, eg: foo 99 1 10001 2 10M1I10M1D10M = 10028 126 AAAAAAAAAACGGGGGGGGGGTTTTTTTTTT 0BBBBBBBB0I1BBBBBBBB12BBBBBBBB2 which leads to (no ref specified so "N"): 1 10001 N 1 ^#A 0 ... 1 10031 N 1 T$ 2 I think the logic here is ^ and $ are regexp start of line and end of line markers, used here for consistency. Start and end means before data and after data. > 3. > > For the insertions in the column of read bases, are the qualities of the > inserted bases given in the next column? What is the base of the read with > a insertion in the reference position? Where is the base quality of the > anchor base? Insertions are between reference coordinates, so they appear after the mapped base and before the next one (which is in the next line of output). The qualities for insertions don't appear anywhere as there is not +/- markup in the quality field. What is the anchor base? I assume it's the one to the left of the insertion? If so that's the mapped base and it appears in the quality field as normal. > 4. > > For the deletions in the column of read bases, what is the base of the > read with a deletion in the reference position? Where is the base quality > of the anchor base? > > I couldn't find the answers for the above questions. Could you please help > me with these problems? Thank you for your time! This one is a bit less obvious. The base for the deletion comes from the reference sequence, so N if unspecified, otherwise whatever the reference has at that coordinate. This is listed after the mapped base, so pos 10, 1M 1D 1M, for seq AT and ref ACT, would be eg 10 A-1C (1M 1D; C from ref) 11 * (1D) 12 T (1M) The quality for the deletion is fake as it doesn't actually exist in the called sequence. However the format requires a base & quality, so it uses base * and quality for the next base instead. James -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help