Hello, On Mon, Jul 09, 2018 at 05:51:15PM +0200, Dr. Frank Vorh?lter wrote: > I had a detailed view on the insert size distribution of NGS read > pairs from a BAM file using SAMtools stats of samtools 1.8 with htslib > 1.8. The BAM file with the mapped reads represents a library where > transposon-based fragmentation resulted in fragments that partly were > shorter than the read lengths of the subsequent sequencing run on an > Illumina instrument. Surprisingly, as you may see from the included > diagram, the number of read pairs increased by almost exactly 100% > when the insert size exceeded 150, which was the number of cycles in > the sequencing run. This phenomenon was not obvious from experimental > measurements of the library fragment lengths.
Doubling in size at 150 sounds suspiciously like a read-length effect, possibly caused by something that thinks exact overlaps are duplicates? Are all of your reads 150bp? Was there anything other than fixmate and sort that was applied to this data? I tried a really noddy example which generates a random genome and picks 100bp pairs with an insert sizes with a flat distribution of 50 to 250bp. See below[1]. I then tried bwasw and mem alignments. Eg: gen.pl > test.fa bwa index ref.fa bwa mem ref.fa test.fa | samtools fixmate - - | tee test.bam | samtools stats - | egrep '^IS' > _ gnuplot ( plot "_" using 2:3 ) The observed plot matches the expected distribution, so I cannot explain the cause of your observed distribution. What if we just look at the isize figures quoted in the BAM file? Eg: samtools view test.bam|awk '{a[$9]++} END {for (i in a) {print i,a[i]}}'|sort -n I also see the same distribution (albeit now positive and negative halves). This is a very crude check to see if the problem lies in the ISIZE values produce by fixmate (which are directly a consequence of the alignments themselves) or in the stats generation code. James [1] #!/usr/bin/perl -w # Create sequence srand(0); my @a = qw/A C G T/; my $len = 100000; my $seq; for (my $i=0; $i<$len; $i++) { $seq .= $a[int(rand()*4)]; } open(my $ref, ">ref.fa") || die; print $ref ">ref\n$seq\n"; close($ref); # Pick pairs of seqs 100 base long, with flat insert size between 50 and 250 for (my $i = 0; $i < 10000; $i++) { my $pos = int(rand()*($len-500*2))+500; my $isize = int(rand()*200)+50; print ">seq_$i\n",substr($seq,$pos,100),"\n"; my $s = reverse(substr($seq,$pos+$isize-100,100)); $s =~ tr/ACGT/TGCA/; print ">seq_$i\n$s\n"; } > > > What might cause this phenomenon, and is there a way to avoid it in case I > made a mistake? The alignment was generated using bwa bwasw and subsequently > processed using samtools fixmate and samtools sort. I were glad to understand > this puzzling effect. > > > Frank > > > > > > > Dr. rer. nat. Frank Vorh?lter > Molekularbiologe > > ?ber?rtliche Berufsaus?bungsgemeinschaft > Medizinisches Versorgungszentrum > Dr. Eberhard & Partner Dortmund > MVZ-Haus 1: Brauhausstr. 4 > 44137 Dortmund, Germany > > > Tel.: +49 231 9572 6612 > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Samtools-help mailing list > Samtools-help@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/samtools-help -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help