Re: [Samtools-help] samtools calmd behavior and MD tag

Martin MOKREJŠ Tue, 02 Aug 2016 02:59:22 -0700

Hi James,
   thank you for a detailed analysis.

James Bonfield wrote:
> On Thu, Jul 28, 2016 at 10:53:53AM +0200, Martin MOKREJ? wrote:
>> $ samtools view dedup.realigned.calmd.bam | grep 
>> HWI-xxxxx:xxx:xxxxxxxxx:1:1106:17285:41358
>> HWI-xxxxx:xxx:xxxxxxxxx:1:1106:17285:41358      145     1       9932    0    
>>    74M     20      59443385        0       
>> GGGGGGTGGGGGTCAGAGCACCACAGGGGGGGGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAACC   
>>    
>> ########################################################)))))))))))))))*)*   
>>    
>> BD:Z:MMMMSRQPPOOOONNNMMMMMMMMMMMMNNNNNNMMMMMMLLLLKKKKKKKKKKKKJJJJKKKKKKKMKFNNMM
>>         PG:Z:MarkDuplicates     RG:Z:some-sample1   
>> BI:Z:FFFFCA@@AAAABCDEFGHIJJKKLLLLLLLLLMMMMMNNNNMMMMMMMLLLLLLLKKKKKKKKKKKLLHHGFF
>>  MQ:i:96 PQ:i:187        UQ:i:38 XQ:i:0  
>> ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@TUTRVSSUTWVUHB@H@Q
>>  NM:i:70 
>> MD:Z:0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0T4
>
> This looks correct to me.
>
>> My questions are:
>>
>>    Is the message 'different NM for read' because I picked up a wrong 
>> reference file than the one used initially by bwa aligner?
>
> Either that or a bug in the aligner itself.  I've seen some very odd
> calculations of MD/NM from some aligners, eg ones that treat N as A
> for hash purposes, etc.  I don't know how bwa handles N, but it may


I heard that as well from Heng Li on the bio-bwa-help list but cannot
find easily the particular email thread.
But manual page also states that:


NOTES ON SHORT-READ ALIGNMENT
    Alignment Accuracy
        ...

        When  gapped  alignment  is disabled, BWA is expected to generate the 
same alignment as Eland version 1, the Illumina alignment program. However, as 
BWA change `N' in the database sequence to random nucleotides, hits to these 
random sequences will also be counted. As a
        consequence, BWA may mark a unique hit as a repeat, if the random 
sequences happen to be identical to the sequences which should be unique in the 
database.

        By default, if the best hit is not highly repetitive (controlled by 
-R), BWA also finds all hits contains one more mismatch; otherwise, BWA finds 
all equally best hits only. Base quality is NOT considered in evaluating hits. 
In the paired-end mode, BWA pairs  all  hits
        it found. It further performs Smith-Waterman alignment for unmapped 
reads to rescue reads with a high error rate, and for high-quality anomalous 
pairs to fix potential alignment errors.



> have also changed over time so you'd need to check the @PG headers to
> see the release number too.
>
>>    In my eyes 
>> MD:Z:0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0T4
>>   means that now the reference contains N's while it did not have them in 
>> the past.
>
> Correct.
>
> $ samtools faidx hs37d5.fa 1:9932-10005
>> 1:9932-10005
> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> NNNNNNNNNTAACC
>
> calmd is claiming all bar the last 4 bases differ and should be
> N{69}T, with the last AACC matching.
>
>>    If the @MD tag is correct it means the read should not be
>> considered as mapped to the chromosome 1 at all, and theoretically
>> could have a better match elsewhere. So, keeping it badly aligned in
>> this locus is wrong I think. So should I try to re-align all reads
>> mentioned on the 'different NM for read' line again? How? Remove
>> them from BAM, mapping newly while creating a new BAM file, then
>> merging the BAM files? Sound I would loose some of the optional
>> columns so the merged BAM file would not have all columns for all
>> reads. Is that a problem?
>
> Again, this depends on the aligner and options used.  I'd agree though
> that it looks like a "courageous" alignment, but possibly it's the
> best (maybe even correct) alignment that places it the expected
> distance apart from a uniquely placed mate pair.

So I conclude bwa mem placed it there maybe thank to its other mate read,
and maybe also thanks to a randomly rewritten sequence at first.
Probably the randomly rewritten sequence should not or is not already
considered if bwa mem know the approximate position and sequence thank
to the other mate. I think these read counts just increase the number of
mapped, maybe even of properly mapped read in the final statistics.
This particular read received '74M' instead of '*'. Am I right that only
thanks to the MD: tag I could get rid of it? Based on that the other mate
could be made more accessible to a re-alignment by some other tool in turn
(probably no tool will try to fiddle with properly placed pair, right?).

>
> At least it got a very low mapping quality.

So do you think generating MD: tag is necessary to get rid of these reads
deemed to be aligned, or do you think the low mapping quality is enough?

>
> Realigning is a little messy.  You'd have to come up with a way to
> filter out the specific read pairs you want to realign (maybe even
> something as dumb as an egrep if there are just a few), align and sort
> them, and then merge your filtered and newly aligned files back again.
>
> (I don't have time to construct such a command line.)

Don't worry, but thank you for detailed answers.




Here is a simpler case which had one change (look for the second mate in the 
pair, it contains one N):

[bam_fillmd1] different NM for read 
'HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038': 0 -> 1

$ samtools view mysample-PB.bam | grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038       99      chr17   19712705        
60      74M     =       19712712        81      
GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA      
<=9?4?@<<?;;7=;<<=7?7877>:88@7?@=>>@6?*=?@0@B6:B<C>B?B=C==CA1<>=8:CC==<=@7      
  PG:Z:MarkDuplicates     RG:Z:mysample-PB   NM:i:0  SM:i:0  MQ:i:0  PQ:i:0  
UQ:i:0  XQ:i:0
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038       147     chr17   19712712        
60      74M     =       19712705        -81     
TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG      
8B>877A99;8=?9>78>88=?=BC7A<4!981?C6B?88@=A=@@><A9<AA=A=?@<@C7?5@><???4>7<      
  PG:Z:MarkDuplicates     RG:Z:mysample-PB   NM:i:0  SM:i:0  MQ:i:0  PQ:i:0  
UQ:i:0  XQ:i:0
$

$ samtools view mysample-PB.realignedtogether.BQSR.namesorted.fixmate.bam | 
grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038       99      chr17   19712705        
60      74M     =       19712712        81      
GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA      
>?A?9CACCG679F7<<B9E6976C>99E7ECCGGE:E-CEB0ED;;G;GBGFE?G?BFB1:G;8:HBBEAB<<      
  MC:Z:74M        
BD:Z:MMNNSQQQQQPOOPPNFFOPPPNMPMPNOPOOOPPMPOPNNPOPPPNNNNMNOLNONNNQPOPNOQQSRRPPPP 
PG:Z:MarkDuplicates     RG:Z:mysample-PB   
BI:Z:DDCFKMMLMMKKMMJKKKMMKLLINKLKNLMMKLLLIKIGJKIJKGFJGIFHGGDEBBDB@?>=><>>>BBEFC 
NM:i:0    SM:i:0  PQ:i:0  UQ:i:0  XQ:i:0  MQ:i:60
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038       147     chr17   19712712        
60      74M     =       19712705        -81     
TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG      
7:=<73E>D?:<F8@39D::AE<EG3DA4!9:3GG4DF:;EAFACCABD8BDF=E:FEBDF7E5EHAED@2>6=      
  MC:Z:74M        
BD:Z:PONNSSPHHPPQQPNOMPMNOPOPQPMMMMOOPPQPPONNNNNOMONNNNPPNNMLNMOPPONOOOOPRRMMMM 
PG:Z:MarkDuplicates     RG:Z:mysample-PB   
BI:Z:CEED@<><<<=;?@??BDEAFGFHJGJIIIJIHKLILLKKMLMLNNNMNMMNNLNMNNMMOLNLNNLMMNCEDD 
NM:i:0    SM:i:0  PQ:i:0  UQ:i:0  XQ:i:0  MQ:i:60
$

$ samtools view mysample-PB.realignedtogether.BQSR.namesorted.fixmate.calmd.bam 
| grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038       99      chr17   19712705        
60      74M     =       19712712        81      
GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA      
>?A?9CACCG679F7<<B9E6976C>99E7ECCGGE:E-CEB0ED;;G;GBGFE?G?BFB1:G;8:HBBEAB<<      
  MC:Z:74M        
BD:Z:MMNNSQQQQQPOOPPNFFOPPPNMPMPNOPOOOPPMPOPNNPOPPPNNNNMNOLNONNNQPOPNOQQSRRPPPP 
PG:Z:MarkDuplicates     RG:Z:mysample-PB   
BI:Z:DDCFKMMLMMKKMMJKKKMMKLLINKLKNLMMKLLLIKIGJKIJKGFJGIFHGGDEBBDB@?>=><>>>BBEFC 
NM:i:0    SM:i:0  PQ:i:0  UQ:i:0  XQ:i:0  MQ:i:60 
ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
MD:Z:74
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038       147     chr17   19712712        
60      74M     =       19712705        -81     
TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG      
7:=<73E>D?:<F8@39D::AE<EG3DA4!9:3GG4DF:;EAFACCABD8BDF=E:FEBDF7E5EHAED@2>6=      
  MC:Z:74M        
BD:Z:PONNSSPHHPPQQPNOMPMNOPOPQPMMMMOOPPQPPONNNNNOMONNNNPPNNMLNMOPPONOOOOPRRMMMM 
PG:Z:MarkDuplicates     RG:Z:mysample-PB   
BI:Z:CEED@<><<<=;?@??BDEAFGFHJGJIIIJIHKLILLKKMLMLNNNMNMMNNLNMNNMMOLNLNNLMMNCEDD 
SM:i:0    PQ:i:0  UQ:i:0  XQ:i:0  MQ:i:60 
ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
NM:i:1  MD:Z:29T44
$

So, there is one 'N' which was replaced by bwa during alignment randomly and 
maybe it was incidentally an exact match (25% probab.).




Could samtools calmd apply the following logic for bwa-processed input?
  Get positions of all N's in the read.
  Do not complain about those positions which are based on N's.
  Do report other positions.

OK, I know "samtools calmd" reports a total sum of the differences before and 
after but quite likely there will be only a "0 -> 0" to be reported. So I will 
get less warning on the screen, which is always good. ;)

bwa-0.7.13-r1126, samtools-1.3 (using htslib 1.3).


Thank you,
Martin

------------------------------------------------------------------------------
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Re: [Samtools-help] samtools calmd behavior and MD tag

Reply via email to