Re: [Samtools-help] Extremely long reference sequences

Nowoshilow,Sergej Fri, 09 Mar 2018 15:57:30 -0800

Dear Thomas,

Thank you for the quick response and for the suggestion!
I created a clean test case file and tried both - "samtools index" with the 
"-b" switch, which should generate a normal BAI file and predictably failed, 
and with the "-c -m 34" switches, which should have generated an index file 
with Bin 0 spanning 2^34 (~17 billion) bases. However, it also failed.

BAI:
samtools index test.sorted.bam 
[E::hts_idx_push] Region 536870821..536870933 cannot be stored in a bai index. 
Try using a csi index with min_shift = 14, n_lvls >= 6
samtools index: failed to create index for "test.sorted.bam": Numerical result 
out of range 

CSI:
samtools index -c -m 34 test.sorted.bam 
[E::hts_idx_push] Region 2671..2751 cannot be stored in a csi index with 
min_shift = 34, n_lvls = 10.  Try using  min_shift = 14, n_lvls >= 0
samtools index: failed to create index for "test.sorted.bam": Numerical result 
out of range

When I run the tool as suggested with min_shift = 14 I get another error 
message:
samtools index -c -m 14 test.sorted.bam 
[E::hts_idx_push] chromosome blocks not continuous
samtools index: failed to create index for "test.sorted.bam": No such file or 
directory

Apparently, the BAM file is not quite correctly sorted after all. A simple test 
samtools view test.sorted.bam | cut -f3 | uniq
proofs that it is indeed the case, since the scaffold IDs are sorted over ~90% 
of the file (e.g. from scaffold001-scaffold100), while the last 10% are not 
sorted at all, e.g. scaffold052 may follow the scaffold100 an so on. Therefore, 
the problem is rather "samtools sort" and not "samtools index". I scanned 
through the source code in hope to find a variable that would be a good 
candidate for run over. However, they seem to be at least of type INT, which 
would be fine for sequences up to 2^32-1, while the longest scaffold I have so 
far is 1.9Gbp.

At the moment I'm stuck, but I will continue tomorrow. Please let me know if 
there is an obvious workaround for that issue.. I may try to sort the reads 
myself, though, but I'm wondering if I'm overlooking something.

Thanks a lot!
Best regards
Sergej

Dr. Sergej Nowoshilow
Post-doc in Tanaka Lab

Elly Tanaka group
Animal models of regeneration
Campus-Vienna-Biocenter 1
1030 Vienna

email: sergej.nowoshi...@imp.ac.at
phone: +43 (0) 1 79730 3203

This message is confidential and may contain privileges information. It is 
intended for the named recipients only. If you receive it in error please 
notify me and permanently delete the original message and any copies.

Am 09.03.18, 22:56 schrieb "Thomas W. Blackwell" <tbla...@umich.edu>:

    Sergej  -

    I can't give you the details, but you should look at the SAM/BAM 
    Format document from www.htslib.org.  My old copy is dated 28 Dec 
    2014.  Go to page 15, the two-line paragraph just before 5.1.2. 
    This suggests using CSI index format rather than the default BAI 
    index format.  I think samtools supports both.  Let the list know if 
    this helps.

                                                -  tom blackwell  -

    On Fri, 9 Mar 2018, Nowoshilow,Sergej wrote:

    > Dear SAMtools developers and community
    >
    > Our group is working with the axolotl genome, which is 10x larger than 
that of the human. It has 14 chromosomes and, thus, some (if not all) of the 
chromosomes are longer than 2Gbp.. Although we don?t have chromosome-size 
scaffolds yet, we are trying our best and managed to assemble some very long 
scaffolds (with quite some gaps ? N?s): ~1.5Gbp.
    > Now I am running into problems with those long scaffolds, since although 
it is perfectly possible to map the RNA/DNAseq reads to the scaffolds it is not 
possible to sort and index the resulting BAM files, which means that they 
cannot be viewed in the genome browser?
    > I tried ?samtools index -c -m?, but unsuccessfully irrespective of the 
value specified by the ?-m? option?
    > The problem seems to be the LENGTH of the scaffold. I also looked at the 
source code (however, not deep enough, therefore, excuse me if I?m wrong) and 
did some testing with different datasets and reference sequences and have a 
feeling that some internal variables might ?overrun? if the position of a read 
within the scaffold exceeds ~500,000,000.. is it right?
    > If I?m right, is there any fix to that problem or is it an inherent issue 
that cannot be fixed easily?
    >
    > I would highly appreciate any advice on how to deal with that issue. 
Theoretically, I could split our long scaffolds into shorter pieces, however, 
that would defeat the notion of assembling chromosome-size scaffolds.
    >
    > Thank you very much in advance!
    > Best regards
    > Sergej
    >
    >
    > Dr. Sergej Nowoshilow
    > Post-doc/Bioinformatician in Tanaka Lab
    >
    > Elly Tanaka group
    > Animal models of regeneration
    > Campus-Vienna-Biocenter 1
    > 1030 Vienna
    >
    > email: sergej.nowoshi...@imp.ac.at
    > phone: +43 (0) 1 79730 3203
    >
    > This message is confidential and may contain privileges information. It 
is intended for the named recipients only. If you receive it in error please 
notify me and permanently delete the original message and any copies.
    >
    >

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Re: [Samtools-help] Extremely long reference sequences

Reply via email to