On Wed, May 10, 2017 at 11:34:46AM -0600, Brent Pedersen wrote:
> I have a .crai with a negative alignment span. The row looks like this:
> 
> 2225    1       -2147483648     14896634174     936     628560
> 
> I'm wondering what that means. It was created with samtools 1.3.1

It means there is a bug I suspect!  Thanks for raising this.

The format fields are in the document as you saw, but also seen in
code at https://github.com/samtools/htslib/blob/develop/cram/cram_index.c#L568

It's ref seq number (aka "tid" in BAM-world), ref seq start and span
(start+span-1 == end), file offset of container start, offset within
container of slice and slice size.  These permit random access from
any given genome coordinate.

-2147483648 is -2^31, also "INT_MIN" in C.  This occurs in the
cram_index_build_multiref() function which deals with indexing slices
where multiple references occur within that same container.  That
method was added for handling excessively fragmented assemblies
(leading to potentially millions of containers), but is also used for
packing the tiny references together at the end of many human aligned
files.  Where there several references around 2225 that also shared
14896634174 as the container offset?

It's unexpected to see INT_MIN make it through that code and out the
other side though!  I'll need to study the code and work out how it
happened.  It looks like it's somehow got references that aren't in
the headers, but I don't see how that can happen.  If you have public
test data that causes this then it would be useful. If not, could you
tell me what the alignments are on this reference?  I'm wondering if
it could happen if we have an unmapped but placed read as the only
read aligned to a reference.  I'll do some tests...

James

PS. Normally for multi-ref containers I expect to see "span" filled
out correctly.  Eg in this example where multiple references share the
same container/slice offset:

35      88      36030   3185574033      1205    15021
36      43      37439   3185590285      1044    63977
37      176     37857   3185590285      1044    63977
38      125     20358   3185590285      1044    63977
38      20404   18061   3185655334      1135    65872
39      103     38695   3185655334      1135    65872
40      175     39568   3185655334      1135    65872
41      7       6593    3185655334      1135    65872
41      6613    33242   3185722369      977     68659
42      40      39217   3185722369      977     68659
43      74      12165   3185722369      977     68659
43      12207   27760   3185792033      929     66582
44      60      19052   3185792033      929     66582
44      19131   21364   3185859572      1138    65580

-- 
James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
                                  | Plurima gyrabant gymbolitare vabo;
  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to