On Tue, 18 Jan 2022, Sabine van Schie wrote:
Hi all,
I used samtools index to index cram files and tried to view the data in the
Integrated Genomic Viewer (https://software.broadinstitute.org/software/igv/ ).
The samples are whole-genome sequencing data from S. cerevisiae.
We have downloaded the cram files from our collaborator and I have simply
used the default samtools index command to index the cram file. I did not
get any errors at this stage.
For some files I'm having problems viewing certain parts of the genome.
However, for other files (from the same batch and using the same ref genome
and alignment method), I can view the exact same region just fine. The
error in IGV is as follows:
Error - possible sequence mismatch (wrong reference for this file):
htsjdk.samtools.cram.CRAMException: Reference sequence MD5 mismatch for
slice: sequence id 10, start 1, span 666454, expected MD5
c831338ca9079d76bebd0b0a5eb102ef
I have attached one cram index file that gives this error when looking at
e.g. YKR002W on Chr XI. I haven't attached the corresponding cram file
because it is rather large (200 MB), but if needed I can try to send this
or other info that is missing.
Luckily, I recognised the name in your attached index file. The original
CRAM file came from here, and I have a copy of it. Your index matches the
one I have (once they've been uncompressed), so it's very likely the same
CRAM file as well.
You could try running a quick checksum over your copy to see if it matches
mine:
# md5sum 42363_4#33.cram
c285d1a69b3fca6c7383154c690901b5 42363_4#33.cram
Looking for Chr XI in the header gives this:
# samtools view -H 42363_4#33.cram | grep -w SN:chrXI
@SQ SN:chrXI LN:666454
UR:/lustre/scratch120/npg_repository/references/Saccharomyces_cerevisiae/S288c/all/fasta/S_cerevisiae_S288c.fa
AS:S288 M5:c831338ca9079d76bebd0b0a5eb102ef SP:Saccharomyces cerevisiae
So the M5: field matches the one in your error report. It is also known
to the EBI's reference sequence server:
# curl https://www.ebi.ac.uk/ena/cram/md5/c831338ca9079d76bebd0b0a5eb102ef |
md5sum
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 650k 100 650k 0 0 2892k 0 --:--:-- --:--:-- --:--:-- 2892k
c831338ca9079d76bebd0b0a5eb102ef -
I can read all of chrXI in my copy which suggests that my copy of the CRAM
file is OK:
# samtools view 42363_4#33.cram chrXI | wc -l
24035
After a bit of hunting around (helped by my having the original fasta file
used for the alignment) I managed to track chrXI down to this genbank
entry:
https://www.ncbi.nlm.nih.gov/nuccore/gi%7C83722562
Note that this has since been updated, and the new version is a bit
longer. Are you using the newer reference in IGV? If so, it may explain
the problem. I'm not sure how IGV gets the reference sequence it needs to
decompress the CRAM file, but if it uses the one you gave it and that
doesn't match the one used for compression then you will see an error like
the one you got.
I can send you a copy of the reference used to align the data in the CRAM
file separately. You can try it in IGV to see if it fixes the problem.
Since I'm new to the cram format, and samtools in general, I'm not sure
whether the issue arises during the indexing with samtool and how I could
check this.
I would be very happy with any troubleshooting tips!
I'm using: samtools 1.14 Using htslib 1.14.
Rob Davies r...@sanger.ac.uk
The Sanger Institute http://www.sanger.ac.uk/
Hinxton, Cambs.,
CB10 1SA, U.K.
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help