Hi James, thank you for a detailed analysis. James Bonfield wrote: > On Thu, Jul 28, 2016 at 10:53:53AM +0200, Martin MOKREJ? wrote: >> $ samtools view dedup.realigned.calmd.bam | grep >> HWI-xxxxx:xxx:xxxxxxxxx:1:1106:17285:41358 >> HWI-xxxxx:xxx:xxxxxxxxx:1:1106:17285:41358 145 1 9932 0 >> 74M 20 59443385 0 >> GGGGGGTGGGGGTCAGAGCACCACAGGGGGGGGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAACC >> >> ########################################################)))))))))))))))*)* >> >> BD:Z:MMMMSRQPPOOOONNNMMMMMMMMMMMMNNNNNNMMMMMMLLLLKKKKKKKKKKKKJJJJKKKKKKKMKFNNMM >> PG:Z:MarkDuplicates RG:Z:some-sample1 >> BI:Z:FFFFCA@@AAAABCDEFGHIJJKKLLLLLLLLLMMMMMNNNNMMMMMMMLLLLLLLKKKKKKKKKKKLLHHGFF >> MQ:i:96 PQ:i:187 UQ:i:38 XQ:i:0 >> ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@TUTRVSSUTWVUHB@H@Q >> NM:i:70 >> MD:Z:0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0T4 > > This looks correct to me. > >> My questions are: >> >> Is the message 'different NM for read' because I picked up a wrong >> reference file than the one used initially by bwa aligner? > > Either that or a bug in the aligner itself. I've seen some very odd > calculations of MD/NM from some aligners, eg ones that treat N as A > for hash purposes, etc. I don't know how bwa handles N, but it may
I heard that as well from Heng Li on the bio-bwa-help list but cannot find easily the particular email thread. But manual page also states that: NOTES ON SHORT-READ ALIGNMENT Alignment Accuracy ... When gapped alignment is disabled, BWA is expected to generate the same alignment as Eland version 1, the Illumina alignment program. However, as BWA change `N' in the database sequence to random nucleotides, hits to these random sequences will also be counted. As a consequence, BWA may mark a unique hit as a repeat, if the random sequences happen to be identical to the sequences which should be unique in the database. By default, if the best hit is not highly repetitive (controlled by -R), BWA also finds all hits contains one more mismatch; otherwise, BWA finds all equally best hits only. Base quality is NOT considered in evaluating hits. In the paired-end mode, BWA pairs all hits it found. It further performs Smith-Waterman alignment for unmapped reads to rescue reads with a high error rate, and for high-quality anomalous pairs to fix potential alignment errors. > have also changed over time so you'd need to check the @PG headers to > see the release number too. > >> In my eyes >> MD:Z:0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0T4 >> means that now the reference contains N's while it did not have them in >> the past. > > Correct. > > $ samtools faidx hs37d5.fa 1:9932-10005 >> 1:9932-10005 > NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN > NNNNNNNNNTAACC > > calmd is claiming all bar the last 4 bases differ and should be > N{69}T, with the last AACC matching. > >> If the @MD tag is correct it means the read should not be >> considered as mapped to the chromosome 1 at all, and theoretically >> could have a better match elsewhere. So, keeping it badly aligned in >> this locus is wrong I think. So should I try to re-align all reads >> mentioned on the 'different NM for read' line again? How? Remove >> them from BAM, mapping newly while creating a new BAM file, then >> merging the BAM files? Sound I would loose some of the optional >> columns so the merged BAM file would not have all columns for all >> reads. Is that a problem? > > Again, this depends on the aligner and options used. I'd agree though > that it looks like a "courageous" alignment, but possibly it's the > best (maybe even correct) alignment that places it the expected > distance apart from a uniquely placed mate pair. So I conclude bwa mem placed it there maybe thank to its other mate read, and maybe also thanks to a randomly rewritten sequence at first. Probably the randomly rewritten sequence should not or is not already considered if bwa mem know the approximate position and sequence thank to the other mate. I think these read counts just increase the number of mapped, maybe even of properly mapped read in the final statistics. This particular read received '74M' instead of '*'. Am I right that only thanks to the MD: tag I could get rid of it? Based on that the other mate could be made more accessible to a re-alignment by some other tool in turn (probably no tool will try to fiddle with properly placed pair, right?). > > At least it got a very low mapping quality. So do you think generating MD: tag is necessary to get rid of these reads deemed to be aligned, or do you think the low mapping quality is enough? > > Realigning is a little messy. You'd have to come up with a way to > filter out the specific read pairs you want to realign (maybe even > something as dumb as an egrep if there are just a few), align and sort > them, and then merge your filtered and newly aligned files back again. > > (I don't have time to construct such a command line.) Don't worry, but thank you for detailed answers. Here is a simpler case which had one change (look for the second mate in the pair, it contains one N): [bam_fillmd1] different NM for read 'HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038': 0 -> 1 $ samtools view mysample-PB.bam | grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 99 chr17 19712705 60 74M = 19712712 81 GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA <=9?4?@<<?;;7=;<<=7?7877>:88@7?@=>>@6?*=?@0@B6:B<C>B?B=C==CA1<>=8:CC==<=@7 PG:Z:MarkDuplicates RG:Z:mysample-PB NM:i:0 SM:i:0 MQ:i:0 PQ:i:0 UQ:i:0 XQ:i:0 HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 147 chr17 19712712 60 74M = 19712705 -81 TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG 8B>877A99;8=?9>78>88=?=BC7A<4!981?C6B?88@=A=@@><A9<AA=A=?@<@C7?5@><???4>7< PG:Z:MarkDuplicates RG:Z:mysample-PB NM:i:0 SM:i:0 MQ:i:0 PQ:i:0 UQ:i:0 XQ:i:0 $ $ samtools view mysample-PB.realignedtogether.BQSR.namesorted.fixmate.bam | grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 99 chr17 19712705 60 74M = 19712712 81 GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA >?A?9CACCG679F7<<B9E6976C>99E7ECCGGE:E-CEB0ED;;G;GBGFE?G?BFB1:G;8:HBBEAB<< MC:Z:74M BD:Z:MMNNSQQQQQPOOPPNFFOPPPNMPMPNOPOOOPPMPOPNNPOPPPNNNNMNOLNONNNQPOPNOQQSRRPPPP PG:Z:MarkDuplicates RG:Z:mysample-PB BI:Z:DDCFKMMLMMKKMMJKKKMMKLLINKLKNLMMKLLLIKIGJKIJKGFJGIFHGGDEBBDB@?>=><>>>BBEFC NM:i:0 SM:i:0 PQ:i:0 UQ:i:0 XQ:i:0 MQ:i:60 HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 147 chr17 19712712 60 74M = 19712705 -81 TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG 7:=<73E>D?:<F8@39D::AE<EG3DA4!9:3GG4DF:;EAFACCABD8BDF=E:FEBDF7E5EHAED@2>6= MC:Z:74M BD:Z:PONNSSPHHPPQQPNOMPMNOPOPQPMMMMOOPPQPPONNNNNOMONNNNPPNNMLNMOPPONOOOOPRRMMMM PG:Z:MarkDuplicates RG:Z:mysample-PB BI:Z:CEED@<><<<=;?@??BDEAFGFHJGJIIIJIHKLILLKKMLMLNNNMNMMNNLNMNNMMOLNLNNLMMNCEDD NM:i:0 SM:i:0 PQ:i:0 UQ:i:0 XQ:i:0 MQ:i:60 $ $ samtools view mysample-PB.realignedtogether.BQSR.namesorted.fixmate.calmd.bam | grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated. HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 99 chr17 19712705 60 74M = 19712712 81 GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA >?A?9CACCG679F7<<B9E6976C>99E7ECCGGE:E-CEB0ED;;G;GBGFE?G?BFB1:G;8:HBBEAB<< MC:Z:74M BD:Z:MMNNSQQQQQPOOPPNFFOPPPNMPMPNOPOOOPPMPOPNNPOPPPNNNNMNOLNONNNQPOPNOQQSRRPPPP PG:Z:MarkDuplicates RG:Z:mysample-PB BI:Z:DDCFKMMLMMKKMMJKKKMMKLLINKLKNLMMKLLLIKIGJKIJKGFJGIFHGGDEBBDB@?>=><>>>BBEFC NM:i:0 SM:i:0 PQ:i:0 UQ:i:0 XQ:i:0 MQ:i:60 ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ MD:Z:74 HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 147 chr17 19712712 60 74M = 19712705 -81 TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG 7:=<73E>D?:<F8@39D::AE<EG3DA4!9:3GG4DF:;EAFACCABD8BDF=E:FEBDF7E5EHAED@2>6= MC:Z:74M BD:Z:PONNSSPHHPPQQPNOMPMNOPOPQPMMMMOOPPQPPONNNNNOMONNNNPPNNMLNMOPPONOOOOPRRMMMM PG:Z:MarkDuplicates RG:Z:mysample-PB BI:Z:CEED@<><<<=;?@??BDEAFGFHJGJIIIJIHKLILLKKMLMLNNNMNMMNNLNMNNMMOLNLNNLMMNCEDD SM:i:0 PQ:i:0 UQ:i:0 XQ:i:0 MQ:i:60 ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ NM:i:1 MD:Z:29T44 $ So, there is one 'N' which was replaced by bwa during alignment randomly and maybe it was incidentally an exact match (25% probab.). Could samtools calmd apply the following logic for bwa-processed input? Get positions of all N's in the read. Do not complain about those positions which are based on N's. Do report other positions. OK, I know "samtools calmd" reports a total sum of the differences before and after but quite likely there will be only a "0 -> 0" to be reported. So I will get less warning on the screen, which is always good. ;) bwa-0.7.13-r1126, samtools-1.3 (using htslib 1.3). Thank you, Martin ------------------------------------------------------------------------------ _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help