On Thu, Feb 27, 2020 at 04:36:39AM +0000, Richard Wilton wrote: > That's what the SAM spec says in its description of the @SQ header record. > But given the wide variety of sort orders that can be represented in SAM > format, this can't be right.
For SAM, every record contains a copy of the SQ SN string. However it is indeed sorting by the order of SQ entries rather than the actual RNAME string itself. BAM does not store the SN string verbatim. Instead it has a binary refID ("tid" in samtools world) which is an integer referencing the Nth SQ line. It's a bit more logical as sorting by this refID integer is the same as sorting by SQ line order, so there is no confusion. Hence the sort order remains the same for both SAM and BAM. > In fact, the set of @SQ records functions as a lookup table, keyed by SN and > referenced by RNAME in each alignment record. Is that what the spec is > trying to say? I see the distinction - sorting by the key itself rather than sorting by the ordering of SQ lines. However I think the spec is saying exactly what you think it's trying to say, so if you find something ambiguous please quote the exact bit that is confusing. It currently states: "For coordinate sort, the major sort key is the {\sf RNAME} field, with order defined by the order of {\tt @SQ} lines in the header." This is correct. The key which is sorted on is RNAME and the collation order is the Nth line rather than the string itself. I'm not sure what you mean by "wide variety of sort orders". It's true users can define their own sort orderings and we have a sub-sort field too, but in those cases the above sentence doesn't apply (note it is prefaced with "For coordinate sort"). Thanks, James PS. Note while the spec says one thing, it is not necessarily what happens in the wild due to accidents and misunderstandings. I've seen BAMs where they have been split apart by chromosome with BAM filenames named after RNAME, then some form of parallel processing takes place, and finally a "samtools cat in.*.bam -o out.bam" is ran which ends up placing in lexicographic RNAME order rather than Nth SQ line order, despite the @HD header claiming to be coordinate sorted. Obviously this is incorrect, although many tools will still work. -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help