Hi James,

Thanks for your reply. I used UCSC hg19 from Illumina iGenomes. Is the value 
8192 abnormally large for NM tag? This is a BAM from RNA-seq, so there is an 
intron of 8192 nt. How does NM compute? Does NM consider CIGAR "N" into 
account?  Thanks.

One example read in the Initial BAM
DF9F08P1:388:C737JACXX:1:1207:17084:11061       163     chr1    10412727        
255     68M8192N8M      =       10412789        138     
GTTGATGCCATCCTCTCCCTAAATATTATTTCTGCCAAGTACCTGAAGTCTTCCCACAACTCTAGCAGGACCTTCT    
CCCFFFFFHHHHGJJJJJJIJIJJIJIJJJJJJJJJJJJHHIJJJIIJGIBGIIJJJJJJJJIJJJJIJJJJIJJH    
AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:76 YS:i:0  YT:Z:CP XS:A:+  
NH:i:1
The same read after conversion:
DF9F08P1:388:C737JACXX:1:1207:17084:11061       163     chr1    10412727        
255     68M8192N8M      =       10412789        138     
GTTGATGCCATCCTCTCCCTAAATATTATTTCTGCCAAGTACCTGAAGTCTTCCCACAACTCTAGCAGGACCTTCT    
CCCFFFFFHHHHGJJJJJJIJIJJIJIJJJJJJJJJJJJHHIJJJIIJGIBGIIJJJJJJJJIJJJJIJJJJIJJH    
AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  YS:i:0  YT:Z:CP XS:A:+  NH:i:1  MD:Z:76 
NM:i:8192

Initial BAM header:
@HD     VN:1.0  SO:coordinate
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@PG     ID:hisat2       PN:hisat2       VN:2.0.1-beta

header of BAM in the end:
@HD     VN:1.0  SO:coordinate
@SQ     SN:chrM LN:16571        M5:d2ed829b8a1628d16cbeee88e88e39eb     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr1 LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr2 LN:243199373    M5:a0d9851da00400dec1098a9255ac712e     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr3 LN:198022430    M5:641e4338fa8d52a5b781bd2a2c08d3c3     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr4 LN:191154276    M5:23dccd106897542ad87d2765d28a19a1     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr5 LN:180915260    M5:0740173db9ffd264d728f32784845cd7     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr6 LN:171115067    M5:1d3a93a248d92a729ee764823acbbc6b     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr7 LN:159138663    M5:618366e953d6aaad97dbe4777c29375e     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr8 LN:146364022    M5:96f514a9929e410c6651697bded59aec     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr9 LN:141213431    M5:3e273117f15e0a400f01055d9f393768     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr10        LN:135534747    M5:988c28e000e84c26d552359af1ea2e1d     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr11        LN:135006516    M5:98c59049a2df285c76ffb1c6db8f8b96     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr12        LN:133851895    M5:51851ac0e1a115847ad36449b0015864     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr13        LN:115169878    M5:283f8d7892baa81b510a015719ca7b0b     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr14        LN:107349540    M5:98f3cae32b2a2e9524bc19813927542e     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr15        LN:102531392    M5:e5645a794a8238215b2cd77acb95a078     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr16        LN:90354753     M5:fc9b1a7b42b97a864f56b348b06095e6     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr17        LN:81195210     M5:351f64d4f4f9ddd45b35336ad97aa6de     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr18        LN:78077248     M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr19        LN:59128983     M5:1aacd71f30db8e561810913e0b72636d     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr20        LN:63025520     M5:0dec9660ec1efaaf33281c0d5ea2560f     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr21        LN:48129895     M5:2979a6085bfe28e3ad6f552f361ed74d     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chr22        LN:51304566     M5:a718acaa6135fdca8357d5bfe94211dd     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chrX LN:155270560    M5:7e0e2e580297b7764e31dbc80c2540dd     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@SQ     SN:chrY LN:59373566     M5:1e86411d73e6f00a10590f976be01623     
UR:/home/lin/genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
@PG     ID:hisat2       PN:hisat2       VN:2.0.1-beta

Chih-Hsu​
------------------------------------------------------------
​Chih-Hsu Lin
Grad student
Lichtarge Lab (T929)
SCBMB, Baylor College of Medicine
chihh...@bcm.edu
________________________________________
From: James Bonfield <j...@sanger.ac.uk>
Sent: Wednesday, May 18, 2016 3:25:17 AM
To: Lin, Chih-Hsu
Cc: samtools-help@lists.sourceforge.net
Subject: Re: [Samtools-help] About BAM->CRAM->BAM

On Tue, May 17, 2016 at 09:17:20PM +0000, Lin, Chih-Hsu wrote:
> During the conversion, BAM->CRAM->BAM, using samtools 1.3.1, I found the NM 
> tags were changed. Does anyone have solution to that?

Can you give us a concrete example please; an alignment record along
with the relevant @SQ line so we can see what the reference is.

The reason for NM and MD changes is that CRAM doesn't explicitly store
these (although it could, it leads to larger files).  Instead it uses
the reference to compute them on-the-fly.  However we have seen a
number of cases where NM/MD in the original BAM file are incorrect,
due to bugs in aligners.  This leads to changes after round-tripping
through CRAM.

So that said, are the NM values output by CRAM->BAM the same values
that samtools calmd generates?  If so then this is fixing your data!
If not then we possibly have a bug that needs fixing.

Some have asked for a way to store the invalid data in CRAM
regardless.  There is perhaps some (albeit twisted) logic to this as
it makes the validation of the data the responsibility of other tools
and not the file format itself.  I experimented with this by writing
out all NM/MD, but it usually leads to 5-10% growth.  A better
solution would be to check when they differ to the computed values and
only store them then, although that will slow up CRAM encoding
somewhat so I'm not convinced yet this is a problem in need of a
solution.

James

--
James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
                                  | Plurima gyrabant gymbolitare vabo;
  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi.


--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to