On Tue, Jan 18, 2022 at 05:52:48PM +0000, Peter Cock via Samtools-help wrote:
> My guess is a Windows vs Linux new line ending change during a file transfer
> could have thrown some things off, or a rarer network glitch is possible
> too.
> It should be easy to check and rule this out.

It shouldn't be that as the MD5sum is calculated after white-space
removal and uppercasing the sequence.

> slice: sequence id 10, start 1, span 666454, expected MD5 
> c831338ca9079d76bebd0b0a5eb102ef

This implies it's the first 666454 bp of ref number 10 (note that'll
be the 11th @SQ line as it starts at zero).  You can get that from the
header.  Once you know the textual name, you should be able to do a
"samtools faidx ref.fa region" on the reference to pull out the
associated sequence.

Are there any non ACGTN characters in it (|tr -d "ACGTN \012")?  IIRC
the CRAM spec states quite clearly how to compute the MD5, but
potentially the two implementations differ?  I'd hope not, but clearly
something does!

The other thing to check is whether the file contains an embedded
reference, and if so whether it matches the external reference.  I
have attempted to do this myself before by using a consensus as a
reference in order to reduce the number of sequence differences being
encoded (albeit then at a cost of greater MD/NM tag size).  That
worked fine for samtools, but htsjdk wouldn't swallow it as it
validated the internal embedded sequence MD5sum against the external
reference and then threw its arms up in the air in alarm. :)

The Staden io_lib "cram_dump" tool can tell you that sort of thing,
but there's no easy way to interrogate the file to that detail within
samtools.

Io_lib also comes with a cram_filter tool that lets you select
specific slice numbers to produce a smaller CRAM.
Eg cram_filter -n 34-36 in.cram out.cram.

That can sometimes help to reduce the data set down to something
minimal, for easy testing and distribution of problematic data sets.

James


-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to