Hello Will, On Mon, Oct 29, 2018 at 04:25:38PM -0400, Will Stokes wrote: > Hello, I have been experimenting with using htslib to read SAM / BAM > content which has gone well. I'd now like to be able to write SAM / BAM > content de Novo. I see two options: > > 1) write SAM content myself, then use samtools or htslib to convert from > SAM to BAM format
That would work, but is likely slow. Writing SAM efficiently isn't easy as the general C/C++ functions tend to be rather inefficient at formatting numbers. > 2) write BAM content directly using htslib. > > For example, creating the header seems relatively straightforward: > > header = bam_hdr_init(); > > header->n_targets = count; > > header->target_len = (uint32_t*)calloc( count, sizeof(uint32_t) ); > > for(int i=0; i < count; i++) > > { > > header->target_len[i] = nameLens[i].length(); > > header->target_name[i] = strdup( names[i] ); > > } > > > I don't think I need to populate header->text and header->l_text, > correct me if I'm wrong. I'm not 100% sure about headers. I think it's permissible to simply have a BAM file containing the arrays of seq name/length, but this doesn't leave any room for useful meta data such as read-groups. It's also mandatory to have @SQ M5 headers if you're writing CRAM. Basically I think it'll work without a header, but it's strongly recommended to create one. > Writing looks straight forward as well: > > > samFile* file = sam_open( tempFile, "w" ); > > sam_hdr_write( file, header ); > > > I'm assuming the filename extension is used by sam_open to determine if it > should write out using SAM or BAM encoding. Please confirm. The mode string handles this, so "w" for sam, "wb" for bam and "wc" for cram. There's also sam_open_mode() which takes format as a string (sam, bam, cram), or look for the suffix if no format string was given. It then calls sam_open with the appropriate mode string. > However, in order to write alignments... > > for( auto alignment : alignments ) > > { sam_write1( file, header, alignment ); } > > > I'll first need a list of bam1_t objects. I'm seeing guidance for constucting > > bam1_t and the related bam1_core_t structures. I have been poking > around in the test folders for both samtools and htslib and must have > overlooked such an example. > > Are there any examples of doing this? This interface is sadly lacking! We really need a public facing way of doing this. The in-memory bam1_t / bam1_core_t structures are laid out in the uncompressed BAM format as defined in the SAM spec (barring the odd bit of memory alignment). It's tedious to do by hand, but not that hard. There are htslib internal functions for this though. The main sam_parse1 function constructs a bam1_t structure, but it's optimised for speed, not readability, so it is quite hard to follow. The CRAM implementation added a "bam_construct_seq" function which takes a series of strings and integers to generate a bam1_t struct. I think this is an internal function, but you can look at the code at least as a demonstration for how to do it. See htslib/cram/cram_samtools.c. A note of confusion-avoidance, bam_seq_t is the same as bam1_t. It's a typedef simply due to the ancestry of this code - the CRAM code started off as part of Staden io_lib, which had a different type name. I really should go through and rename them all, but it's not externally visible so has never been high priority. For example usage of this, check sam_read1() in sam.c and follow the chain of calling via cram_get_bam_seq -> cram_to_bam -> bam_construct_seq. James -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help