On Wed, 19 Jan 2022, Sabine van Schie wrote:

Dear Robert,

What a coincidence you saw my message and recognized the file, thank you
for your time.

The md5 sum and the ChrXI header match with what you obtained using the
original file.
With the reference genome used for alignment the error disappears (also for
other files), so the *problem is solved!*
I was afraid the file perhaps got corrupted during the download/index
process, but luckily this seems not to be the case.

To answer to your comment about the IGV version: I had also tried the newer
version of IGV with preinstalled SacCer3, but that gave an error already at
the loading stage.. Using the older version with SacCer2 installed, I could
at least upload the files and therefore thought this might be the correct
ref genome.. but clearly there were also some differences here.

Yes, IGV's error reporting could be a bit better. Where alignments have the @SQ M5: tags in the header, it should be able to compare with the reference you supplied and quickly tell you you're using the wrong one.

In order to be more independent next time (and maybe this is useful for
other people in this mailing list as well), is there a way to obtain the
original reference genome based on information encoded in the cram file?
I spent some time looking into this, and had scanned the full cram file
headers to see if I could find anything useful, but couldn't figure it out.
This:
UR:/lustre/scratch120/npg_repository/references/Saccharomyces_cerevisiae/S288c/all/fasta/S_cerevisiae_S288c.fa
doesn't
lead me anywhere.  I can indeed retrieve the sequence for chr XI at
https://www.ebi.ac.uk/ena/cram/  using the M5 ID
(M5:c831338ca9079d76bebd0b0a5eb102ef), but when I try the Md5 checksum
(i.e. c285d1a69b3fca6c7383154c690901b5) I get an error.
Is one supposed to download all the chromosomes separately and then
concatenate?

It is possible to do this, by scanning the CRAM header for the names and checksums, downloading the sequences and reformatting a bit. For example, this rather hacky 1-liner does the trick:

samtools view -H 42363_4#33.cram | perl -ne 'if (/^\@SQ/) { ($n) = /\tSN:(\S+)/; ($m) = /\tM5:([0-9a-f]{32})/; if ($n 
&& $m) { print ">$n\n"; system("curl https://www.ebi.ac.uk/ena/cram/md5/$m | fold -w 70"); 
print "\n"; } }' > /tmp/S_cerevisiae_S288c.fa

I final question, which you may ignore if you don't have it, do you have an
annotation file that can be used with this particular reference genome
version?  I can probably download it from SGD (
http://sgd-archive.yeastgenome.org/sequence/S288C_reference/ ), but if you
have a matching annotation file quickly accessible this might be useful.

I'm afraid I don't. The SGD site looks like it has a good collection of historic S288C assemblies, so hopefully you can find what you need there. If all else fails, you can always dump out the original sequences from the CRAM file and realign them to an up-to-date reference as another way to solve the problem.

It is also possible to get information on some of the references by pasting the M5 checksum into the ENA's text search. For example, the one for chrMT:

https://www.ebi.ac.uk/ena/browser/text-search?query=71c39cf065b8d574f636b654c274cf1b

finds (if you click "show all results") 
https://www.ebi.ac.uk/ena/browser/view/KP263414

Unfortunately it doesn't work for all of them, possibly because the ENA haven't checksummed old versions of sequences. I'll try to remember to ask them next time I talk to them, to see if that is the case.

Thank you again for your help and have a nice day.

Thanks, glad I was able to help,

Rob Davies              r...@sanger.ac.uk
The Sanger Institute    http://www.sanger.ac.uk/
Hinxton, Cambs.,
CB10 1SA, U.K.


--
The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to