On Thu, Aug 02, 2018 at 02:44:39PM -0400, Thomas W. Blackwell wrote:
> The problematic feature is this:  if the header of the existing .bam
> file already contains md5 checksums in the "@SQ" header lines for
> the individual hg19 sequence contigs used in read mapping, then
> samtools will copy those md5 checksums into the header of the .cram
> file, without checking whether they match the contig checksums of
> the genome reference now being used for compression.  No errors or
> error messages are generated at this point. But when you try to read
> the resulting .cram, samtools uses the reference originally used for
> mapping, not the one used for compression, and the .cram file is
> unreadable ... until you re-write the .cram header with the contig
> md5 checksums of the reference sequence actually used for
> compression.

It was a concious decision to not check a manually specified reference
matches the M5 header tags because it incurs a *significant* cpu cost,
especially when handling small portions of a data set and people
(almost) universally specify the same reference.

However perhaps it should be something we can add using e.g.
"--input-fmt-option recompute_md5".  Any thoughts on this?

> So your compression procedure will need to be:  first write the
> .cram using "samtools view", then rewrite the header on the new
> .cram using "samtools reheader".  An alternative is to rewrite the
> .bam file first, leaving the contig names but omitting their
> checksums.  If there are no checksums already, then samtools will
> automatically put the correct checksums into the .cram file header.
> (However, the first way, fixing the .cram header, is faster than
> fixing the .bam header.)

This also leads to broken data if the new vs old references need
indels to align to each other.  Your CRAM will decode and give the
same data back again, but all variant calling is completely wrong as
reads are aligned in different locations.  I think the only case it
doesn't break horribly is masking (as mentioned).

CRAM was never really designed to change references underneath it and
has implicit assumptions that it's using the aligned reference to
encode against, as this is what leads to best compression.  I don't
understand the benefit of specifying another reference.  What is the
problem you are attempting to fix here?

If you want to cope with the reference being deleted, I'd suggest
using the embedded reference feature instead which copies the relevant
parts of the reference into each cram slice.

James

-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to