James, Thank you so much for those incredibly helpful pointers. I'm happy to report I have been able ot use Htslib to write out an alignment in BAM and SAM format that I created denovo. That said, I have a few remaining questions. I want to make sure I'm doing everything correctly.
-When including a read that aligns in the reverse orientation, I'm using 0x10 (decimal 16) in the flag. When writing the sequence to bam1_t.data, should I be encoding the reverse complement? Similarly, should the CIGAR refer to how the reverse complement compares to the top strand in the reference sequence? I'm currently doing both. -Is it required to initialize the following bam_hdr_t attributes? int32_t ignore_sam_err int8_t *cigar_tab void *sdict At the moment I'm using calloc to allocate the bam_hdr_t, so all are 0. This appears to be working fine. I'm honestly not sure what these fields encode or how to use them. -What is bam1_core_t.isize for? It isn't documented in sam.h. I have it set to 0 at the moment. -What should I set bam1_t.id to? Right now I have it set to 0. -I'm not writing any "extra bytes" after the read quality, aka "extra_len". What is this region used for? Many thanks in advance for your help. -Will On Tue, Oct 30, 2018 at 5:39 AM James Bonfield <j...@sanger.ac.uk> wrote: > Hello Will, > > On Mon, Oct 29, 2018 at 04:25:38PM -0400, Will Stokes wrote: > > Hello, I have been experimenting with using htslib to read SAM / BAM > > content which has gone well. I'd now like to be able to write SAM / BAM > > content de Novo. I see two options: > > > > 1) write SAM content myself, then use samtools or htslib to convert from > > SAM to BAM format > > That would work, but is likely slow. Writing SAM efficiently isn't > easy as the general C/C++ functions tend to be rather inefficient at > formatting numbers. > > > 2) write BAM content directly using htslib. > > > > For example, creating the header seems relatively straightforward: > > > > header = bam_hdr_init(); > > > > header->n_targets = count; > > > > header->target_len = (uint32_t*)calloc( count, sizeof(uint32_t) ); > > > > for(int i=0; i < count; i++) > > > > { > > > > header->target_len[i] = nameLens[i].length(); > > > > header->target_name[i] = strdup( names[i] ); > > > > } > > > > > > I don't think I need to populate header->text and header->l_text, > > correct me if I'm wrong. > > I'm not 100% sure about headers. I think it's permissible to simply > have a BAM file containing the arrays of seq name/length, but this > doesn't leave any room for useful meta data such as read-groups. It's > also mandatory to have @SQ M5 headers if you're writing CRAM. > > Basically I think it'll work without a header, but it's strongly > recommended to create one. > > > > Writing looks straight forward as well: > > > > > > samFile* file = sam_open( tempFile, "w" ); > > > > sam_hdr_write( file, header ); > > > > > > I'm assuming the filename extension is used by sam_open to determine if > it > > should write out using SAM or BAM encoding. Please confirm. > > The mode string handles this, so "w" for sam, "wb" for bam and "wc" > for cram. > > There's also sam_open_mode() which takes format as a string (sam, bam, > cram), or look for the suffix if no format string was given. It then > calls sam_open with the appropriate mode string. > > > However, in order to write alignments... > > > > for( auto alignment : alignments ) > > > > { sam_write1( file, header, alignment ); } > > > > > > I'll first need a list of bam1_t objects. I'm seeing guidance for > constucting > > > > bam1_t and the related bam1_core_t structures. I have been poking > > around in the test folders for both samtools and htslib and must have > > overlooked such an example. > > > > Are there any examples of doing this? > > This interface is sadly lacking! We really need a public facing way > of doing this. > > The in-memory bam1_t / bam1_core_t structures are laid out in the > uncompressed BAM format as defined in the SAM spec (barring the odd > bit of memory alignment). It's tedious to do by hand, but not that > hard. > > There are htslib internal functions for this though. The main > sam_parse1 function constructs a bam1_t structure, but it's optimised > for speed, not readability, so it is quite hard to follow. > > The CRAM implementation added a "bam_construct_seq" function which > takes a series of strings and integers to generate a bam1_t struct. I > think this is an internal function, but you can look at the code at > least as a demonstration for how to do it. See > htslib/cram/cram_samtools.c. > > A note of confusion-avoidance, bam_seq_t is the same as bam1_t. It's > a typedef simply due to the ancestry of this code - the CRAM code > started off as part of Staden io_lib, which had a different type > name. I really should go through and rename them all, but it's not > externally visible so has never been high priority. > > For example usage of this, check sam_read1() in sam.c and follow the > chain of calling via cram_get_bam_seq -> cram_to_bam -> > bam_construct_seq. > > James > > -- > James Bonfield (j...@sanger.ac.uk) > The Sanger Institute, Hinxton, Cambs, CB10 1SA > > > -- > The Wellcome Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. >
_______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help