On Tue, 18 Jan 2022, Sabine van Schie wrote:

Hi all,

I used samtools index to index cram files and tried to view the data in the
Integrated Genomic Viewer (https://software.broadinstitute.org/software/igv/ ).
The samples are whole-genome sequencing data from S. cerevisiae.

We have downloaded the cram files from our collaborator and I have simply
used the default samtools index command to index the cram file. I did not
get any errors at this stage.

For some files I'm having problems viewing certain parts of the genome.
However, for other files (from the same batch and using the same ref genome
and alignment method), I can view the exact same region just fine. The
error in IGV is as follows:

Error - possible sequence mismatch (wrong reference for this file):
htsjdk.samtools.cram.CRAMException: Reference sequence MD5 mismatch for
slice: sequence id 10, start 1, span 666454, expected MD5
c831338ca9079d76bebd0b0a5eb102ef

I have attached one cram index file that gives this error when looking at
e.g. YKR002W on Chr XI. I haven't attached the corresponding cram file
because it is rather large (200 MB), but if needed I can try to send this
or other info that is missing.

Luckily, I recognised the name in your attached index file. The original CRAM file came from here, and I have a copy of it. Your index matches the one I have (once they've been uncompressed), so it's very likely the same CRAM file as well.

You could try running a quick checksum over your copy to see if it matches mine:

# md5sum 42363_4#33.cram
c285d1a69b3fca6c7383154c690901b5  42363_4#33.cram

Looking for Chr XI in the header gives this:

# samtools view -H 42363_4#33.cram | grep -w SN:chrXI
@SQ     SN:chrXI        LN:666454       
UR:/lustre/scratch120/npg_repository/references/Saccharomyces_cerevisiae/S288c/all/fasta/S_cerevisiae_S288c.fa
  AS:S288     M5:c831338ca9079d76bebd0b0a5eb102ef SP:Saccharomyces cerevisiae

So the M5: field matches the one in your error report. It is also known to the EBI's reference sequence server:

# curl https://www.ebi.ac.uk/ena/cram/md5/c831338ca9079d76bebd0b0a5eb102ef | 
md5sum
  % Total    % Received % Xferd  Average Speed   Time    Time     Time   Current
                                 Dload  Upload   Total   Spent    Left   Speed
100  650k  100  650k    0     0  2892k      0 --:--:-- --:--:-- --:--:-- 2892k
c831338ca9079d76bebd0b0a5eb102ef  -


I can read all of chrXI in my copy which suggests that my copy of the CRAM file is OK:

# samtools view 42363_4#33.cram chrXI | wc -l
24035

After a bit of hunting around (helped by my having the original fasta file used for the alignment) I managed to track chrXI down to this genbank entry:

https://www.ncbi.nlm.nih.gov/nuccore/gi%7C83722562

Note that this has since been updated, and the new version is a bit longer. Are you using the newer reference in IGV? If so, it may explain the problem. I'm not sure how IGV gets the reference sequence it needs to decompress the CRAM file, but if it uses the one you gave it and that doesn't match the one used for compression then you will see an error like the one you got.

I can send you a copy of the reference used to align the data in the CRAM file separately. You can try it in IGV to see if it fixes the problem.

Since I'm new to the cram format, and samtools in general, I'm not sure
whether the issue arises during the indexing with samtool and how I could
check this.

I would be very happy with any troubleshooting tips!
I'm using: samtools 1.14 Using htslib 1.14.

Rob Davies              r...@sanger.ac.uk
The Sanger Institute    http://www.sanger.ac.uk/
Hinxton, Cambs.,
CB10 1SA, U.K.


--
The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to