Thanks for the reply, the old mailing list model dies hard :) Samtools view seems to decode this correctly (with or without the reference sequence that they sent to me, a MT chromosome, specified with -T) with that exact CIGAR string, and additionally calculates an MD string...
All letters in the reference they sent me are upper cased in the file I received, and from their commands they sent it looks like they were consistent about using the same reference -Colin On Thu, May 30, 2019 at 5:08 AM James Bonfield <j...@sanger.ac.uk> wrote: > Hello Colin, > > Apologies for the slow reply. I don't normally check samtools > mailing list unless my email client tells me something new arrived, > but for some reason that's broken. Anyway... > > On Wed, May 22, 2019 at 01:25:59PM -0400, Colin wrote: > > Before: > > > > [ { code: 'X', data: 0, pos: 21, refPos: 198 }, > > { code: 'X', data: 2, pos: 52, refPos: 229 }, > > { code: 'X', data: 0, pos: 54, refPos: 231 }, > > { code: 'X', data: 2, pos: 70, refPos: 247 }, > > { code: 'X', data: 2, pos: 80, refPos: 257 }, > > { code: 'X', data: 1, pos: 86, refPos: 263 }, > > { code: 'I', data: 'CT', pos: 133, refPos: 310 }, > > { code: 'X', data: 1, pos: 135, refPos: 310 } ] > > > > After: > > > > [ { code: 'b', > > > > data: > '65,84,84,65,67,65,71,71,67,71,65,65,67,65,84,65,67,84,84,65,65,84,65,65,65,71,84,71,84,71,84,84,65,65,84,84,65,65,84,84,65,65,84,71,67,84,84,71,84,65,71,84,65,65,65,84,65,65,84,65,65,84,65,65,67,65,65,84,84,84,65,65,84,71,84,67,84,71,67,84,67,65,71,67,67,71,67,84,84,84,67,67,65,67,65,67,65,71,65,67,65,84,67,65,84,65,65,67,65,65,65,65,65,65,84,84,84,67,67,65,67,67,65,65,65,67,67,67,67,67,67,67', > > pos: 1, > > refPos: 178 }, > > { code: 'I', data: 'CT', pos: 133, refPos: 310 }, > > { code: 'Q', data: 36, pos: 133, refPos: 308 }, > > { code: 'Q', data: 36, pos: 134, refPos: 309 }, > > { code: 'b', > > data: '67,67,67,67,67,67,71,67,84,84,67,84,71,71,67', > > pos: 135, > > refPos: 310 } ] > > That's very mysterious. It appears it's replacing everything that > isn't the insertion with "b". Maybe it's become misaligned due to > some process and rather than create a whole string of SNPs it's > created "b" features instead? I'm guessing this must be htsjdk as my > code doesn't take that approach, even though it may well be a better > strategy. > > Does the sequence in question decode correctly? Does it have a > sensible looking CIGAR string? Eg 132M2I15M? > > > The header of the file that contained the 'b' features used > MarkDuplicates > > I don't see why MarkDuplicates would be changing alignments in any > way. It may however be decoding the record and re-encoding it, which > could change the decisions as to how to encode something (especially > if the original file was, say, written by samtools and the subsequent > one is by picard). > > Is the reference in question here uppercase or lowercase? If I recall > there were issues with that at one point. > > James > > -- > James Bonfield (j...@sanger.ac.uk) > The Sanger Institute, Hinxton, Cambs, CB10 1SA > > > -- > The Wellcome Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. >
_______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help