Dear Robert,

What a coincidence you saw my message and recognized the file, thank you
for your time.

The md5 sum and the ChrXI header match with what you obtained using the
original file.
With the reference genome used for alignment the error disappears (also for
other files), so the *problem is solved!*
I was afraid the file perhaps got corrupted during the download/index
process, but luckily this seems not to be the case.

To answer to your comment about the IGV version: I had also tried the newer
version of IGV with preinstalled SacCer3, but that gave an error already at
the loading stage.. Using the older version with SacCer2 installed, I could
at least upload the files and therefore thought this might be the correct
ref genome.. but clearly there were also some differences here.

In order to be more independent next time (and maybe this is useful for
other people in this mailing list as well), is there a way to obtain the
original reference genome based on information encoded in the cram file?
I spent some time looking into this, and had scanned the full cram file
headers to see if I could find anything useful, but couldn't figure it out.
This:
UR:/lustre/scratch120/npg_repository/references/Saccharomyces_cerevisiae/S288c/all/fasta/S_cerevisiae_S288c.fa
doesn't
lead me anywhere.  I can indeed retrieve the sequence for chr XI at
https://www.ebi.ac.uk/ena/cram/ using the M5 ID
(M5:c831338ca9079d76bebd0b0a5eb102ef), but when I try the Md5 checksum
(i.e. c285d1a69b3fca6c7383154c690901b5) I get an error.
Is one supposed to download all the chromosomes separately and then
concatenate?

I final question, which you may ignore if you don't have it, do you have an
annotation file that can be used with this particular reference genome
version?  I can probably download it from SGD (
http://sgd-archive.yeastgenome.org/sequence/S288C_reference/), but if you
have a matching annotation file quickly accessible this might be useful.

Thank you again for your help and have a nice day.

Best,

Sabine




Op wo 19 jan. 2022 om 01:08 schreef Robert Davies <r...@sanger.ac.uk>:

> On Tue, 18 Jan 2022, Sabine van Schie wrote:
>
> > Hi all,
> >
> > I used samtools index to index cram files and tried to view the data in
> the
> > Integrated Genomic Viewer (
> https://software.broadinstitute.org/software/igv/ ).
> > The samples are whole-genome sequencing data from S. cerevisiae.
> >
> > We have downloaded the cram files from our collaborator and I have simply
> > used the default samtools index command to index the cram file. I did not
> > get any errors at this stage.
> >
> > For some files I'm having problems viewing certain parts of the genome.
> > However, for other files (from the same batch and using the same ref
> genome
> > and alignment method), I can view the exact same region just fine. The
> > error in IGV is as follows:
> >
> > Error - possible sequence mismatch (wrong reference for this file):
> > htsjdk.samtools.cram.CRAMException: Reference sequence MD5 mismatch for
> > slice: sequence id 10, start 1, span 666454, expected MD5
> > c831338ca9079d76bebd0b0a5eb102ef
> >
> > I have attached one cram index file that gives this error when looking at
> > e.g. YKR002W on Chr XI. I haven't attached the corresponding cram file
> > because it is rather large (200 MB), but if needed I can try to send this
> > or other info that is missing.
>
> Luckily, I recognised the name in your attached index file.  The original
> CRAM file came from here, and I have a copy of it.  Your index matches the
> one I have (once they've been uncompressed), so it's very likely the same
> CRAM file as well.
>
> You could try running a quick checksum over your copy to see if it matches
> mine:
>
> # md5sum 42363_4#33.cram
> c285d1a69b3fca6c7383154c690901b5  42363_4#33.cram
>
> Looking for Chr XI in the header gives this:
>
> # samtools view -H 42363_4#33.cram | grep -w SN:chrXI
> @SQ     SN:chrXI        LN:666454
>  
> UR:/lustre/scratch120/npg_repository/references/Saccharomyces_cerevisiae/S288c/all/fasta/S_cerevisiae_S288c.fa
> AS:S288     M5:c831338ca9079d76bebd0b0a5eb102ef SP:Saccharomyces cerevisiae
>
> So the M5: field matches the one in your error report.  It is also known
> to the EBI's reference sequence server:
>
> # curl https://www.ebi.ac.uk/ena/cram/md5/c831338ca9079d76bebd0b0a5eb102ef
> | md5sum
>    % Total    % Received % Xferd  Average Speed   Time    Time     Time
>  Current
>                                   Dload  Upload   Total   Spent    Left
>  Speed
> 100  650k  100  650k    0     0  2892k      0 --:--:-- --:--:-- --:--:--
> 2892k
> c831338ca9079d76bebd0b0a5eb102ef  -
>
>
> I can read all of chrXI in my copy which suggests that my copy of the CRAM
> file is OK:
>
> # samtools view 42363_4#33.cram chrXI | wc -l
> 24035
>
> After a bit of hunting around (helped by my having the original fasta file
> used for the alignment) I managed to track chrXI down to this genbank
> entry:
>
> https://www.ncbi.nlm.nih.gov/nuccore/gi%7C83722562
>
> Note that this has since been updated, and the new version is a bit
> longer.  Are you using the newer reference in IGV?  If so, it may explain
> the problem.  I'm not sure how IGV gets the reference sequence it needs to
> decompress the CRAM file, but if it uses the one you gave it and that
> doesn't match the one used for compression then you will see an error like
> the one you got.
>
> I can send you a copy of the reference used to align the data in the CRAM
> file separately.  You can try it in IGV to see if it fixes the problem.
>
> > Since I'm new to the cram format, and samtools in general, I'm not sure
> > whether the issue arises during the indexing with samtool and how I could
> > check this.
> >
> > I would be very happy with any troubleshooting tips!
> > I'm using: samtools 1.14 Using htslib 1.14.
>
> Rob Davies              r...@sanger.ac.uk
> The Sanger Institute    http://www.sanger.ac.uk/
> Hinxton, Cambs.,
> CB10 1SA, U.K.
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to