Hello Will, On Mon, Nov 05, 2018 at 02:09:09PM -0500, Will Stokes wrote: > Thank you so much for those incredibly helpful pointers. I'm happy to > report I have been able ot use Htslib to write out an alignment in BAM and > SAM format that I created denovo. That said, I have a few remaining > questions. I want to make sure I'm doing everything correctly. > > -When including a read that aligns in the reverse orientation, I'm using > 0x10 (decimal 16) > in the flag. When writing the sequence to bam1_t.data, should I be encoding > the reverse complement? > Similarly, should the CIGAR refer to how the reverse complement compares to > the > top strand in the reference sequence? I'm currently doing both.
Yes to both. BAM (and SAM) store everything about the sequence in the mapped orientation. > -Is it required to initialize the following bam_hdr_t attributes? > > int32_t ignore_sam_err > int8_t *cigar_tab > void *sdict > > At the moment I'm using calloc to allocate the bam_hdr_t, so all are 0. > This appears to be working fine. I'm honestly not > sure what these fields encode or how to use them. The htslib bam_hdr_init() function initialises a bam_hdr_t struct (although it's nothing more than a calloc in implementation). cigar_tab is some bizarre caching of data in sam_parse1. I've no idea why it's in the header structure rather than simply being defined as a static array in the code - it never changes. sdict is only initialised when needed, to perform @SQ line lookups. Leaving it null is appropriate. I didn't even realise ignore_sam_err existed. It appears to be completely unused except in the test harness. It's probably an idea which never went anywhere, but perhaps it was added for some other tool that uses htslib? (If so there's no indication in git logs.) > -What is bam1_core_t.isize for? It isn't documented in sam.h. I have it set > to 0 at the moment. Leaving it as zero is fine. It is the TLEN field in the SAM format and 0 is the appropriate "unknown" value. > -What should I set bam1_t.id to? Right now I have it set to 0. That came in during #1b554a48 (2012) with no explanation. It's set by the iterator code, but is apparently unused after that. I've no idea what the intended purpose is, but leaving it as zero is correct as it's what the rest of the code does (with the exception of iterators). > -I'm not writing any "extra bytes" after the read quality, aka "extra_len". > What is this region used for? Do you mean extranul? This is used to pad out read names to end on a multiple of 4 bytes in the in-memory address, so that the 32-bit cigar fields are correctly aligned in memory. (Failure to do this can cause some SIMD compiler optimisations to cause crashes.) We could consider this to be a BAM design flaw as swapping the read name and cigar field would make this hack unnecessary. It's purely an in-memory trick though and when written to disk the layout is as per the specification. Provided your in-memory layout and extranul usage are internally consistent you should be fine for writing. James -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help