On Sun, 12 Aug 2018, Nowoshilow,Sergej wrote:

Dear SAMtools community,

I am trying to view the contents of a SAM file using “samtools view”, but 
unfortunately it fails with the following error message: [W::sam_parse1] 
urecognized reference name; treated as unmapped
I need to use SAMtools to calculate the statistics and subsequently convert the 
SAM file to BAM.

My first guess was that the chromosome name is not present in the header for 
whatever reason. However, if I open the SAM file with the text viewer (e.g. 
less), I see that all chromosomes are there:

@HD     VN:1.0  SO:unsorted
@SQ     SN:chr1 LN:2955898172
@SQ     SN:chr2 LN:2923849100
@SQ     SN:chr3 LN:2495613285
@SQ     SN:chr4 LN:2453368029
@SQ     SN:chr5 LN:2630329465
@SQ     SN:chr6 LN:3136375085
@SQ     SN:chr7 LN:2029504283
@SQ     SN:chr8 LN:1710882196
@SQ     SN:chr9 LN:1495491496
@SQ     SN:chr10        LN:1639492812
@SQ     SN:chr11        LN:1436510532
@SQ     SN:chr12        LN:1210869585
@SQ     SN:chr13        LN:718908455
@SQ     SN:chr14        LN:657912494

I assume that the error occurs because the lengths of the first 6 chromosomes 
are longer than 2^31 bp. Those are the ones that generate the error messages in 
the “samtools view” output. Chromosomes 7-14 are present in the output.
Is this assumption correct?

Yes, that's most likely the cause.

Is there an easy fix for this issue? I would prefer not to split the 
chromosomes into p and q arms at this point. However, I might have to do so if 
there is no other way.

No, it's not easy to fix as the assumption that a chromosome position can be stored in a 32-bit integer is embedded all the way through the library, and also in both the BAM and CRAM file formats. So I'm afraid converting to BAM is impossible for this data with the current format.

Having said that, there is a work in progress to allow for long chromosomes. If you want to try it, take a look at pull request 709:
https://github.com/samtools/htslib/pull/709

I've taken the liberty of rebasing it, so it is possible to build a copy of samtools develop that uses it. Amazingly it mostly works - fixmate seems to be the only breakage when running its test harness.

It will currently only allow references up to about 4Gbases. More updates will be needed to go higher than that.

This patch is very experimental. It will only work for SAM files - trying to make a BAM or CRAM will either crash or (more likely) create a broken file with incorrect positions in it. The SAM files may not be readable by other tools, depending on how they store the positions. It's likely that more updates will be needed to all of htslib, samtools and bcftools before this all actually works properly.

So, if you want something reliable that works now, you probably need to split the chromosomes. But if you're feeling adventurous then you might want to try the pull request to see if it works for what you need.

Rob Davies              r...@sanger.ac.uk
The Sanger Institute    http://www.sanger.ac.uk/
Hinxton, Cambs.,        Tel. +44 (1223) 834244
CB10 1SA, U.K.          Fax. +44 (1223) 494919


--
The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to