Sorry this is long, so I'll get to the question upfront.

Do people actually CARE about the various breakages in the string
pileup format?  If so read on, if not then fine.


Consider an alignment like this:

Pos  12345  6789
Ref  AAAAA  TTTAA
Seq1 AAAAACCTTTAA (cigar 5M 2I    5M)
Seq2 AAAAA  ***AA (cigar 5M    3D 2M)
Seq3 AAAAACC***AA (cigar 5M 2I 3D 2M)

At pos 5 pileup will format for sq1 a string of A+2CC indicating the 2
bp insertion.  Seq2 will show A-3TTT indicating the 3bp deletion.
Positions 6, 7 and 8 will have "*" present as the base for seq2.

All good so far, but what about seq3?  It currently shows A+2CC.
Shouldn't it be A+2CC-3TTT to indicate the deletion following the
insertion?  The next row in the pileup output is pos 6, which is
already too late, and contrast seq2 to seq3 which both have 3 bases
deleted at pos 6-8, yet only one of them admits what it is.

The pileup format explains +N<base> and -N<base> notations and nowhere
does it exclude having both present at the same point, so I believe
this to be a bug.

However, how to fix it?  This is the hard part.  pileup_seq() in
bam_plcmd.c generates these +N<base> and -N<base> strings, along with
other markers for ref-skip (cigar "N").  It does so by looking at the
p->indel and p->isdel variables.  p->indel is the size of the indel;
positive for insertion and negative for deletion.  This fundamentally
makes it impossible to report insertion followed by deletion as there
simply isn't enough information.

Ideally after an insertion we'd check the next cigar op to see if it's
a deletion so we can print that up too, as we know the next loop
around bam_mplp_auto() will have moved onto the deletion site.  So we
could just check the cigar opcodes to find out what is next.  It's
possible to convert a position P in the sequence to the N-th location
in the K-th cigar op, but it requires scanning the entire cigar string
each time to do that conversion.  The original pileup implementation
did this, but it was horribly slow on long sequences with complex
cigar strings (eg pacbio).  It now caches this data in a cstate_t
struct:

typedef struct {
    int k, x, y, end;
} cstate_t;

This in turn is held in lbnode_t:

typedef struct __linkbuf_t {
    bam1_t b;
    int32_t beg, end;
    cstate_t s;
    struct __linkbuf_t *next;
} lbnode_t;

Which ultimately is linked to in the pileup iterator:

struct __bam_plp_t {
    mempool_t *mp;
    lbnode_t *head, *tail;
    int32_t tid, pos, max_tid, max_pos;
    int is_eof, max_plp, error, maxcnt;
    uint64_t id;
    bam_pileup1_t *plp;
    // for the "auto" interface only
    bam1_t *b;
    bam_plp_auto_f func;
    void *data;
    olap_hash_t *overlaps;
};

"bam_pileup1_t *plp" here is the only exposed part of the interface
and it is this which is given to us after each iterator call.  Ideally
bam_pileup1_t would contain cstate_t so that functions iterating over
the pileup could access information on the local cigar operations.
This would also fix this example:

Ref  AAAAA  TTTAA
Seq1 AAAAACGTTTAA (cigar 5M 2I    5M)
Seq2 AAAAAC*TTTAA (cigar 5M 1I 1P 5M)
Seq3 AAAAA*GTTTAA (cigar 5M 1P 1I 5M)

We'd want a pileup string at pos 5 to be (spaced out for clarity):
A+2CG A+2C* A+2*G

Instead we get the less useful:

A+2CG A+1C A+1G

Again, it's impossible (or at least highly inefficient) to fix this
and add support for the P operator as the pileup callback simply has
no cigar context to go on.

Changing bam_pileup1_t though would break ABI compatibility as it is
commonly supplied in arrays.  Indeed I have an htslib pull request
(https://github.com/samtools/htslib/pull/157) which implements support
for P operator by extending bam_pileup1_t, stalled due to
compatibility.

One solution would be to create bam_pileup2_t (I assume that's what
the mysterious 1 suffix appended to everything was meant to deal with)
that has extra data, along with a duplicate set of functions to
manipulate it, but it's a bit messy.

Another solution is to simply accept the slowness and recompute the
cached cigar index every time we need it in samtools.  This is only
at an insertion, so it's not going to be every call.  It would give us
fixed output, but at the cost of speed.

The final solution, employed so far, is just to say "bof!" and ignore
the problems.  By now I'm sure everyone who is parsing the old pileup
format instead of using, say, gvcf will have dealt with all these
breakages anyway?  Indeed maybe you consider the current bizareness to
be "the format" and therefore any fix is infact a bug.  Ie go and fix
something real! :-)

Any views?

James

-- 
James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
                                  | Plurima gyrabant gymbolitare vabo;
  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to