On Fri, Oct 19, 2018 at 04:30:33PM -0600, Brent Pedersen wrote: > I want to store a minimal representation of an alignment (more minimal > than CRAM). > I thought I could save the position, cigar, and MD and be able to > reconstruct the > read sequence, but the MD (IIUC) allows reconstructing the reference > from the read.
To reconstruct the sequence you'll need an analogue of MD, as you say. This isn't something available at the moment. > Is there something exposed in the current htslib that would facilitate > this? E.g. a tag that's sort of the inverse of MD? I know this must be > used internally for CRAM. Internally CRAM has a number of data series to encode differences between a sequence and a reference. These differences are called "features" and loosely correspond to CIGAR operations. FN (number of features) FP (position of feature; a delta to the last feature) FC (feature code; type of the edit) Then each feature has its own associated data series. Eg FC "X" is substitution, which stores the edit in the BS (base substitution) data series, FC "D" (deletion) stores the deletion length (but not the bases) in DL, and FC "I" (insertion) stores the inserted bases in IN. This is a rather distributed way of doing things, rather than a single aux tag like MD. Hence we don't have any code that can easily be used for this. James -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help