On Tue, Jan 18, 2022 at 05:52:48PM +0000, Peter Cock via Samtools-help wrote: > My guess is a Windows vs Linux new line ending change during a file transfer > could have thrown some things off, or a rarer network glitch is possible > too. > It should be easy to check and rule this out.
It shouldn't be that as the MD5sum is calculated after white-space removal and uppercasing the sequence. > slice: sequence id 10, start 1, span 666454, expected MD5 > c831338ca9079d76bebd0b0a5eb102ef This implies it's the first 666454 bp of ref number 10 (note that'll be the 11th @SQ line as it starts at zero). You can get that from the header. Once you know the textual name, you should be able to do a "samtools faidx ref.fa region" on the reference to pull out the associated sequence. Are there any non ACGTN characters in it (|tr -d "ACGTN \012")? IIRC the CRAM spec states quite clearly how to compute the MD5, but potentially the two implementations differ? I'd hope not, but clearly something does! The other thing to check is whether the file contains an embedded reference, and if so whether it matches the external reference. I have attempted to do this myself before by using a consensus as a reference in order to reduce the number of sequence differences being encoded (albeit then at a cost of greater MD/NM tag size). That worked fine for samtools, but htsjdk wouldn't swallow it as it validated the internal embedded sequence MD5sum against the external reference and then threw its arms up in the air in alarm. :) The Staden io_lib "cram_dump" tool can tell you that sort of thing, but there's no easy way to interrogate the file to that detail within samtools. Io_lib also comes with a cram_filter tool that lets you select specific slice numbers to produce a smaller CRAM. Eg cram_filter -n 34-36 in.cram out.cram. That can sometimes help to reduce the data set down to something minimal, for easy testing and distribution of problematic data sets. James -- James Bonfield (j...@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help