On Wed, Jul 11, 2018 at 03:24:13PM +0200, Dr. Frank Vorh?lter wrote:
> Dear James,
> 
> Thanks a lot for taking care. 
> The alignment that was analysed by samtools was generated by bwa and 
> subsequently processed using samtools. However, older versions of the 
> software were applied, bwa 0.7.5a-r405 and samtools 1.2 (using htslib 1.2.1). 
> The NGS reads had been trimmed before using cutadapt. Hence, part of the 
> reads were shorter than 150.  No other tools than bwa bwasw and samtools 
> fixmate & sort were involved in obtaining the BAM file with the alignment 
> that was analysed using samtools stats. For data analysis, a current version 
> of samtools was applied, samtools 1.8 (using htslib 1.8).

The insert size is very clearly halving before 150 (not doubly after I
think, when compared with the awk data).  I don't see this behaviour
with my tests, but I suspect something is causing it to filter data.

It's curious that there is a difference between a trivial awk script
vs samtools stats.  Obviously there may be a small number of badly
mapped pairs, supplementary and secondary reads, etc that could skew
the figures for the basic awk, but the key thing is there is no
discontinuity in the awk data.

Looking at the samtools stats code it has a check for PAIRED and
!UNMAPPED and !MATE_UNMAPPED.  It doesn't check for the PROPER_PAIR
flag.  (I didn't look at any of these in the awk noddy script.)

It then has some logic to halve the insert counts as the size will be
doubled due to counting each end twice (see comment "// Each pair was
counted twice" in stats.c).  I don't *think* this will be a problem,
but it does seem suspicious.

At this point I'm struggling to work out quite what causes this on
your data as I haven't seen it elsewhere.  Are you using any options
to samtools stats, eg -i for maximum insert size or -f / -F for
filtering?  If you supply -F 0xF00 (filtering SECONDARY, QCFAIL, DUP
and SUPPLEMENTARY) does the problem persist?  Does the awk equivalent
when given samtools view -F 0xF00 as input also show the problem?

I'm clutching at straws here to be honest!

James


-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to