On Thu, Feb 27, 2020 at 04:36:39AM +0000, Richard Wilton wrote:
> That's what the SAM spec says in its description of the @SQ header record.  
> But given the wide variety of sort orders that can be represented in SAM 
> format, this can't be right.

For SAM, every record contains a copy of the SQ SN string.  However it
is indeed sorting by the order of SQ entries rather than the actual
RNAME string itself.

BAM does not store the SN string verbatim.  Instead it has a binary
refID ("tid" in samtools world) which is an integer referencing the
Nth SQ line.  It's a bit more logical as sorting by this refID integer
is the same as sorting by SQ line order, so there is no confusion.

Hence the sort order remains the same for both SAM and BAM.

> In fact, the set of @SQ records functions as a lookup table, keyed by SN and 
> referenced by RNAME in each alignment record.  Is that what the spec is 
> trying to say?

I see the distinction - sorting by the key itself rather than sorting
by the ordering of SQ lines.  However I think the spec is saying
exactly what you think it's trying to say, so if you find something
ambiguous please quote the exact bit that is confusing.  It currently
states:

    "For coordinate sort, the major sort key is the {\sf RNAME} field, with
    order defined by the order of {\tt @SQ} lines in the header."

This is correct.  The key which is sorted on is RNAME and the
collation order is the Nth line rather than the string itself.

I'm not sure what you mean by "wide variety of sort orders".  It's
true users can define their own sort orderings and we have a sub-sort
field too, but in those cases the above sentence doesn't apply (note
it is prefaced with "For coordinate sort").

Thanks,

James

PS.  Note while the spec says one thing, it is not necessarily what
happens in the wild due to accidents and misunderstandings.

I've seen BAMs where they have been split apart by chromosome with BAM
filenames named after RNAME, then some form of parallel processing
takes place, and finally a "samtools cat in.*.bam -o out.bam" is ran
which ends up placing in lexicographic RNAME order rather than Nth SQ
line order, despite the @HD header claiming to be coordinate sorted.
Obviously this is incorrect, although many tools will still work.

-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to