Hello,

One of the things on my to-do list (aka, the github issue has been
assigned to me: https://github.com/samtools/samtools/issues/496) is to
make samtools mpileup support the -a / -a -a options of samtools
depth; to report zero coverage regions.

It sounds trivial, and indeed isn't that hard, but I hit a stumbling
block.  What exactly *is* the mpileup format?  It's documentation is,
well, almost entirely absent.  It has a small paragraph explaining a
few aspects but not remotely enough to parse it properly.  Example:

[From within a git checkout, using its own test data]
@ seq3a[samtools.../samtools]; samtools view -b -s 0.3 test/dat/mpileup.2.sam > 
shallow.bam
@ seq3a[samtools.../samtools]; samtools view -b -s 1.3 test/dat/mpileup.2.sam > 
shallow2.bam
@ seq3a[samtools.../samtools]; samtools index shallow.bam
@ seq3a[samtools.../samtools]; samtools index shallow2.bam

@ seq3a[samtools.../samtools]; ./samtools mpileup -r 17:90-99 -Q 0 shallow.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
17      90      N       1       A       D
17      91      N       1       C       ?
17      92      N       1       G       :
17      93      N       1       A       J
17      94      N       1       C       H
17      95      N       1       C       '
17      96      N       1       A       .
17      97      N       1       A       5
17      98      N       1       C       >
17      99      N       1       T       B
@ seq3a[samtools.../samtools]; ./samtools mpileup -r 17:90-99 -Q 20 shallow.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
17      90      N       1       A       D
17      91      N       1       C       ?
17      92      N       1       G       :
17      93      N       1       A       J
17      94      N       1       C       H
17      95      N       0               
17      96      N       0               
17      97      N       1       A       5
17      98      N       1       C       >
17      99      N       1       T       B

Bases filtered by base quality can cause the depth to drop to zero.
These locations are still reported, but with empty strings.

If we give it multiple files then mpileup will report seq & qual
columns for each file.  The coverage of these may drop to zero in some
cases.  When that happens it uses * * as seq and qual:

@ seq3a[samtools.../samtools]; ./samtools mpileup -r 17:100-109 shallow.bam 
shallow2.bam|less
[mpileup] 1 samples in 2 input files
<mpileup> Set max per-file depth to 8000
17      100     N       1       C       H       0       *       *
17      101     N       1       C       5       0       *       *
17      102     N       1       C       A       0       *       *
17      103     N       1       T       ;       0       *       *
17      104     N       0                       0       *       *
17      105     N       1       G       0       0       *       *
17      106     N       0                       0       *       *
17      107     N       1       C       ;       0       *       *
17      108     N       1       C       <       0       *       *
17      109     N       1       T       I       0       *       *

In this case we see a combination of two different styles for
describing missing data.  They come from different causes (lack of
quality vs lack of coverage) so maybe it's correct, but none of this
is documented and I wonder what I should be doing for my pull request.

So...   my plan is to use the 0 * * notation, but I'm opening it up to
a wider audience here.  I  can clarify the documentation while I'm
at it, but I'm not really a heavy user of this data.  If you parse
this format then now is your chance to have your say.

James

PS. Don't make me regret this post. :-)  I'm tortured by pileup, to
the extent of completely reimplementing it myself once to avoid using
this code!

-- 
James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
                                  | Plurima gyrabant gymbolitare vabo;
  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to