Hello Will,

On Mon, Nov 05, 2018 at 02:09:09PM -0500, Will Stokes wrote:
> Thank you so much for those incredibly helpful pointers. I'm happy to
> report I have been able ot use Htslib to write out an alignment in BAM and
> SAM format that I created denovo. That said, I have a few remaining
> questions. I want to make sure I'm doing everything correctly.
> 
> -When including a read that aligns in the reverse orientation, I'm using
> 0x10 (decimal 16)
> in the flag. When writing the sequence to bam1_t.data, should I be encoding
> the reverse complement?
> Similarly, should the CIGAR refer to how the reverse complement compares to
> the
> top strand in the reference sequence? I'm currently doing both.

Yes to both.  BAM (and SAM) store everything about the sequence in the
mapped orientation.

> -Is it required to initialize the following bam_hdr_t attributes?
> 
> int32_t ignore_sam_err
> int8_t *cigar_tab
> void *sdict
> 
> At the moment I'm using calloc to allocate the bam_hdr_t, so all are 0.
> This appears to be working fine. I'm honestly not
> sure what these fields encode or how to use them.

The htslib bam_hdr_init() function initialises a bam_hdr_t struct
(although it's nothing more than a calloc in implementation).

cigar_tab is some bizarre caching of data in sam_parse1.  I've no idea
why it's in the header structure rather than simply being defined as a
static array in the code - it never changes.

sdict is only initialised when needed, to perform @SQ line lookups.
Leaving it null is appropriate.

I didn't even realise ignore_sam_err existed.  It appears to be
completely unused except in the test harness.  It's probably an idea
which never went anywhere, but perhaps it was added for some other
tool that uses htslib?  (If so there's no indication in git logs.)

> -What is bam1_core_t.isize for? It isn't documented in sam.h. I have it set
> to 0 at the moment.

Leaving it as zero is fine.  It is the TLEN field in the SAM format
and 0 is the appropriate "unknown" value.

> -What should I set bam1_t.id to? Right now I have it set to 0.

That came in during #1b554a48 (2012) with no explanation.  It's set by
the iterator code, but is apparently unused after that.  I've no idea
what the intended purpose is, but leaving it as zero is correct as
it's what the rest of the code does (with the exception of
iterators).

> -I'm not writing any "extra bytes" after the read quality, aka "extra_len".
> What is this region used for?

Do you mean extranul?  This is used to pad out read names to end on a
multiple of 4 bytes in the in-memory address, so that the 32-bit cigar
fields are correctly aligned in memory.  (Failure to do this can cause
some SIMD compiler optimisations to cause crashes.)  We could consider
this to be a BAM design flaw as swapping the read name and cigar field
would make this hack unnecessary.

It's purely an in-memory trick though and when written to disk the
layout is as per the specification.  Provided your in-memory layout
and extranul usage are internally consistent you should be fine for
writing.

James
-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to