On Tue, Mar 06, 2018 at 09:57:25PM +1100, Jing meng wrote:
> at the read base column, a symbol ^' marks the start of a read segment
> which is a contiguous subsequence on the read separated byN/S/H' CIGAR
> operations. The ASCII of the character following `^' minus 33 gives the
> mapping quality.
> The base that is the start of a read segment which is a contiguous
> subsequence on the read separated by `N/S/H' CIGAR operations is before the
> symbol '^' or after the symbol and the ASCII?
It's after. This is trivially visible if you do a samtools mpileup of
a file. Eg the first line in my test file shows:
1 10001 N 1 ^#T A
(By luck, "A" happens to be a quality value here, not a base call).
> A symbol `$' marks the end of a read segment.
> The base that is the end of a read segment is before or after the symbol
In this case it's before. A gain just look at the end of an mpileup
output for example (using -r region to speed it up helps). Or just
make a 1 read SAM file, eg:
foo 99 1 10001 2 10M1I10M1D10M = 10028 126
which leads to (no ref specified so "N"):
1 10001 N 1 ^#A 0
1 10031 N 1 T$ 2
I think the logic here is ^ and $ are regexp start of line and end of
line markers, used here for consistency. Start and end means before
data and after data.
> For the insertions in the column of read bases, are the qualities of the
> inserted bases given in the next column? What is the base of the read with
> a insertion in the reference position? Where is the base quality of the
> anchor base?
Insertions are between reference coordinates, so they appear after the
mapped base and before the next one (which is in the next line of
output). The qualities for insertions don't appear anywhere as there
is not +/- markup in the quality field.
What is the anchor base? I assume it's the one to the left of the
insertion? If so that's the mapped base and it appears in the quality
field as normal.
> For the deletions in the column of read bases, what is the base of the
> read with a deletion in the reference position? Where is the base quality
> of the anchor base?
> I couldn't find the answers for the above questions. Could you please help
> me with these problems? Thank you for your time!
This one is a bit less obvious. The base for the deletion comes from
the reference sequence, so N if unspecified, otherwise whatever the
reference has at that coordinate. This is listed after the mapped
base, so pos 10, 1M 1D 1M, for seq AT and ref ACT, would be eg
10 A-1C (1M 1D; C from ref)
11 * (1D)
12 T (1M)
The quality for the deletion is fake as it doesn't actually exist in
the called sequence. However the format requires a base & quality, so
it uses base * and quality for the next base instead.
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Samtools-help mailing list