James,

Thank you so much for those incredibly helpful pointers. I'm happy to
report I have been able ot use Htslib to write out an alignment in BAM and
SAM format that I created denovo. That said, I have a few remaining
questions. I want to make sure I'm doing everything correctly.

-When including a read that aligns in the reverse orientation, I'm using
0x10 (decimal 16)
in the flag. When writing the sequence to bam1_t.data, should I be encoding
the reverse complement?
Similarly, should the CIGAR refer to how the reverse complement compares to
the
top strand in the reference sequence? I'm currently doing both.

-Is it required to initialize the following bam_hdr_t attributes?

int32_t ignore_sam_err
int8_t *cigar_tab
void *sdict

At the moment I'm using calloc to allocate the bam_hdr_t, so all are 0.
This appears to be working fine. I'm honestly not
sure what these fields encode or how to use them.

-What is bam1_core_t.isize for? It isn't documented in sam.h. I have it set
to 0 at the moment.

-What should I set bam1_t.id to? Right now I have it set to 0.

-I'm not writing any "extra bytes" after the read quality, aka "extra_len".
What is this region used for?

Many thanks in advance for your help.

-Will


On Tue, Oct 30, 2018 at 5:39 AM James Bonfield <j...@sanger.ac.uk> wrote:

> Hello Will,
>
> On Mon, Oct 29, 2018 at 04:25:38PM -0400, Will Stokes wrote:
> > Hello, I have been experimenting with using htslib to read SAM / BAM
> > content which has gone well. I'd now like to be able to write SAM / BAM
> > content de Novo. I see two options:
> >
> > 1) write SAM content myself, then use samtools or htslib to convert from
> > SAM to BAM format
>
> That would work, but is likely slow.  Writing SAM efficiently isn't
> easy as the general C/C++ functions tend to be rather inefficient at
> formatting numbers.
>
> > 2) write BAM content directly using htslib.
> >
> > For example, creating the header seems relatively straightforward:
> >
> >   header = bam_hdr_init();
> >
> >   header->n_targets = count;
> >
> >   header->target_len = (uint32_t*)calloc( count, sizeof(uint32_t) );
> >
> >   for(int i=0; i < count; i++)
> >
> > {
> >
> >     header->target_len[i]  = nameLens[i].length();
> >
> >     header->target_name[i] = strdup( names[i] );
> >
> > }
> >
> >
> > I don't think I need to populate header->text and header->l_text,
> > correct me if I'm wrong.
>
> I'm not 100% sure about headers.  I think it's permissible to simply
> have a BAM file containing the arrays of seq name/length, but this
> doesn't leave any room for useful meta data such as read-groups.  It's
> also mandatory to have @SQ M5 headers if you're writing CRAM.
>
> Basically I think it'll work without a header, but it's strongly
> recommended to create one.
>
>
> > Writing looks straight forward as well:
> >
> >
> > samFile* file = sam_open( tempFile, "w" );
> >
> > sam_hdr_write( file, header );
> >
> >
> > I'm assuming the filename extension is used by sam_open to determine if
> it
> > should write out using SAM or BAM encoding. Please confirm.
>
> The mode string handles this, so "w" for sam, "wb" for bam and "wc"
> for cram.
>
> There's also sam_open_mode() which takes format as a string (sam, bam,
> cram), or look for the suffix if no format string was given.  It then
> calls sam_open with the appropriate mode string.
>
> > However, in order to write alignments...
> >
> > for( auto alignment : alignments )
> >
> > {  sam_write1( file, header, alignment ); }
> >
> >
> > I'll first need a list of bam1_t objects. I'm seeing guidance for
> constucting
> >
> > bam1_t and the related bam1_core_t structures. I have been poking
> > around in the test folders for both samtools and htslib and must have
> > overlooked such an example.
> >
> > Are there any examples of doing this?
>
> This interface is sadly lacking!  We really need a public facing way
> of doing this.
>
> The in-memory bam1_t / bam1_core_t structures are laid out in the
> uncompressed BAM format as defined in the SAM spec (barring the odd
> bit of memory alignment).  It's tedious to do by hand, but not that
> hard.
>
> There are htslib internal functions for this though.  The main
> sam_parse1 function constructs a bam1_t structure, but it's optimised
> for speed, not readability, so it is quite hard to follow.
>
> The CRAM implementation added a "bam_construct_seq" function which
> takes a series of strings and integers to generate a bam1_t struct.  I
> think this is an internal function, but you can look at the code at
> least as a demonstration for how to do it.  See
> htslib/cram/cram_samtools.c.
>
> A note of confusion-avoidance, bam_seq_t is the same as bam1_t.  It's
> a typedef simply due to the ancestry of this code - the CRAM code
> started off as part of Staden io_lib, which had a different type
> name.  I really should go through and rename them all, but it's not
> externally visible so has never been high priority.
>
> For example usage of this, check sam_read1() in sam.c and follow the
> chain of calling via cram_get_bam_seq -> cram_to_bam ->
> bam_construct_seq.
>
> James
>
> --
> James Bonfield (j...@sanger.ac.uk)
> The Sanger Institute, Hinxton, Cambs, CB10 1SA
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to