Sorry this is long, so I'll get to the question upfront. Do people actually CARE about the various breakages in the string pileup format? If so read on, if not then fine.
Consider an alignment like this: Pos 12345 6789 Ref AAAAA TTTAA Seq1 AAAAACCTTTAA (cigar 5M 2I 5M) Seq2 AAAAA ***AA (cigar 5M 3D 2M) Seq3 AAAAACC***AA (cigar 5M 2I 3D 2M) At pos 5 pileup will format for sq1 a string of A+2CC indicating the 2 bp insertion. Seq2 will show A-3TTT indicating the 3bp deletion. Positions 6, 7 and 8 will have "*" present as the base for seq2. All good so far, but what about seq3? It currently shows A+2CC. Shouldn't it be A+2CC-3TTT to indicate the deletion following the insertion? The next row in the pileup output is pos 6, which is already too late, and contrast seq2 to seq3 which both have 3 bases deleted at pos 6-8, yet only one of them admits what it is. The pileup format explains +N<base> and -N<base> notations and nowhere does it exclude having both present at the same point, so I believe this to be a bug. However, how to fix it? This is the hard part. pileup_seq() in bam_plcmd.c generates these +N<base> and -N<base> strings, along with other markers for ref-skip (cigar "N"). It does so by looking at the p->indel and p->isdel variables. p->indel is the size of the indel; positive for insertion and negative for deletion. This fundamentally makes it impossible to report insertion followed by deletion as there simply isn't enough information. Ideally after an insertion we'd check the next cigar op to see if it's a deletion so we can print that up too, as we know the next loop around bam_mplp_auto() will have moved onto the deletion site. So we could just check the cigar opcodes to find out what is next. It's possible to convert a position P in the sequence to the N-th location in the K-th cigar op, but it requires scanning the entire cigar string each time to do that conversion. The original pileup implementation did this, but it was horribly slow on long sequences with complex cigar strings (eg pacbio). It now caches this data in a cstate_t struct: typedef struct { int k, x, y, end; } cstate_t; This in turn is held in lbnode_t: typedef struct __linkbuf_t { bam1_t b; int32_t beg, end; cstate_t s; struct __linkbuf_t *next; } lbnode_t; Which ultimately is linked to in the pileup iterator: struct __bam_plp_t { mempool_t *mp; lbnode_t *head, *tail; int32_t tid, pos, max_tid, max_pos; int is_eof, max_plp, error, maxcnt; uint64_t id; bam_pileup1_t *plp; // for the "auto" interface only bam1_t *b; bam_plp_auto_f func; void *data; olap_hash_t *overlaps; }; "bam_pileup1_t *plp" here is the only exposed part of the interface and it is this which is given to us after each iterator call. Ideally bam_pileup1_t would contain cstate_t so that functions iterating over the pileup could access information on the local cigar operations. This would also fix this example: Ref AAAAA TTTAA Seq1 AAAAACGTTTAA (cigar 5M 2I 5M) Seq2 AAAAAC*TTTAA (cigar 5M 1I 1P 5M) Seq3 AAAAA*GTTTAA (cigar 5M 1P 1I 5M) We'd want a pileup string at pos 5 to be (spaced out for clarity): A+2CG A+2C* A+2*G Instead we get the less useful: A+2CG A+1C A+1G Again, it's impossible (or at least highly inefficient) to fix this and add support for the P operator as the pileup callback simply has no cigar context to go on. Changing bam_pileup1_t though would break ABI compatibility as it is commonly supplied in arrays. Indeed I have an htslib pull request (https://github.com/samtools/htslib/pull/157) which implements support for P operator by extending bam_pileup1_t, stalled due to compatibility. One solution would be to create bam_pileup2_t (I assume that's what the mysterious 1 suffix appended to everything was meant to deal with) that has extra data, along with a duplicate set of functions to manipulate it, but it's a bit messy. Another solution is to simply accept the slowness and recompute the cached cigar index every time we need it in samtools. This is only at an insertion, so it's not going to be every call. It would give us fixed output, but at the cost of speed. The final solution, employed so far, is just to say "bof!" and ignore the problems. By now I'm sure everyone who is parsing the old pileup format instead of using, say, gvcf will have dealt with all these breakages anyway? Indeed maybe you consider the current bizareness to be "the format" and therefore any fix is infact a bug. Ie go and fix something real! :-) Any views? James -- James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help