On Sun, 12 Aug 2018, Nowoshilow,Sergej wrote:
Dear SAMtools community,
I am trying to view the contents of a SAM file using “samtools view”, but
unfortunately it fails with the following error message: [W::sam_parse1]
urecognized reference name; treated as unmapped
I need to use SAMtools to calculate the statistics and subsequently convert the
SAM file to BAM.
My first guess was that the chromosome name is not present in the header for
whatever reason. However, if I open the SAM file with the text viewer (e.g.
less), I see that all chromosomes are there:
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:2955898172
@SQ SN:chr2 LN:2923849100
@SQ SN:chr3 LN:2495613285
@SQ SN:chr4 LN:2453368029
@SQ SN:chr5 LN:2630329465
@SQ SN:chr6 LN:3136375085
@SQ SN:chr7 LN:2029504283
@SQ SN:chr8 LN:1710882196
@SQ SN:chr9 LN:1495491496
@SQ SN:chr10 LN:1639492812
@SQ SN:chr11 LN:1436510532
@SQ SN:chr12 LN:1210869585
@SQ SN:chr13 LN:718908455
@SQ SN:chr14 LN:657912494
I assume that the error occurs because the lengths of the first 6 chromosomes
are longer than 2^31 bp. Those are the ones that generate the error messages in
the “samtools view” output. Chromosomes 7-14 are present in the output.
Is this assumption correct?
Yes, that's most likely the cause.
Is there an easy fix for this issue? I would prefer not to split the
chromosomes into p and q arms at this point. However, I might have to do so if
there is no other way.
No, it's not easy to fix as the assumption that a chromosome position can
be stored in a 32-bit integer is embedded all the way through the library,
and also in both the BAM and CRAM file formats. So I'm afraid converting
to BAM is impossible for this data with the current format.
Having said that, there is a work in progress to allow for long
chromosomes. If you want to try it, take a look at pull request 709:
https://github.com/samtools/htslib/pull/709
I've taken the liberty of rebasing it, so it is possible to build a copy
of samtools develop that uses it. Amazingly it mostly works - fixmate
seems to be the only breakage when running its test harness.
It will currently only allow references up to about 4Gbases. More updates
will be needed to go higher than that.
This patch is very experimental. It will only work for SAM files - trying
to make a BAM or CRAM will either crash or (more likely) create a broken
file with incorrect positions in it. The SAM files may not be readable by
other tools, depending on how they store the positions. It's likely that
more updates will be needed to all of htslib, samtools and bcftools before
this all actually works properly.
So, if you want something reliable that works now, you probably need to
split the chromosomes. But if you're feeling adventurous then you might
want to try the pull request to see if it works for what you need.
Rob Davies r...@sanger.ac.uk
The Sanger Institute http://www.sanger.ac.uk/
Hinxton, Cambs., Tel. +44 (1223) 834244
CB10 1SA, U.K. Fax. +44 (1223) 494919
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help