Hello, One of the things on my to-do list (aka, the github issue has been assigned to me: https://github.com/samtools/samtools/issues/496) is to make samtools mpileup support the -a / -a -a options of samtools depth; to report zero coverage regions.
It sounds trivial, and indeed isn't that hard, but I hit a stumbling block. What exactly *is* the mpileup format? It's documentation is, well, almost entirely absent. It has a small paragraph explaining a few aspects but not remotely enough to parse it properly. Example: [From within a git checkout, using its own test data] @ seq3a[samtools.../samtools]; samtools view -b -s 0.3 test/dat/mpileup.2.sam > shallow.bam @ seq3a[samtools.../samtools]; samtools view -b -s 1.3 test/dat/mpileup.2.sam > shallow2.bam @ seq3a[samtools.../samtools]; samtools index shallow.bam @ seq3a[samtools.../samtools]; samtools index shallow2.bam @ seq3a[samtools.../samtools]; ./samtools mpileup -r 17:90-99 -Q 0 shallow.bam [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 17 90 N 1 A D 17 91 N 1 C ? 17 92 N 1 G : 17 93 N 1 A J 17 94 N 1 C H 17 95 N 1 C ' 17 96 N 1 A . 17 97 N 1 A 5 17 98 N 1 C > 17 99 N 1 T B @ seq3a[samtools.../samtools]; ./samtools mpileup -r 17:90-99 -Q 20 shallow.bam [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 17 90 N 1 A D 17 91 N 1 C ? 17 92 N 1 G : 17 93 N 1 A J 17 94 N 1 C H 17 95 N 0 17 96 N 0 17 97 N 1 A 5 17 98 N 1 C > 17 99 N 1 T B Bases filtered by base quality can cause the depth to drop to zero. These locations are still reported, but with empty strings. If we give it multiple files then mpileup will report seq & qual columns for each file. The coverage of these may drop to zero in some cases. When that happens it uses * * as seq and qual: @ seq3a[samtools.../samtools]; ./samtools mpileup -r 17:100-109 shallow.bam shallow2.bam|less [mpileup] 1 samples in 2 input files <mpileup> Set max per-file depth to 8000 17 100 N 1 C H 0 * * 17 101 N 1 C 5 0 * * 17 102 N 1 C A 0 * * 17 103 N 1 T ; 0 * * 17 104 N 0 0 * * 17 105 N 1 G 0 0 * * 17 106 N 0 0 * * 17 107 N 1 C ; 0 * * 17 108 N 1 C < 0 * * 17 109 N 1 T I 0 * * In this case we see a combination of two different styles for describing missing data. They come from different causes (lack of quality vs lack of coverage) so maybe it's correct, but none of this is documented and I wonder what I should be doing for my pull request. So... my plan is to use the 0 * * notation, but I'm opening it up to a wider audience here. I can clarify the documentation while I'm at it, but I'm not really a heavy user of this data. If you parse this format then now is your chance to have your say. James PS. Don't make me regret this post. :-) I'm tortured by pileup, to the extent of completely reimplementing it myself once to avoid using this code! -- James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help