On Fri, Aug 3, 2018 at 2:30 AM James Bonfield <j...@sanger.ac.uk> wrote:

> On Thu, Aug 02, 2018 at 02:44:39PM -0400, Thomas W. Blackwell wrote:
> > The problematic feature is this:  if the header of the existing .bam
> > file already contains md5 checksums in the "@SQ" header lines for
> > the individual hg19 sequence contigs used in read mapping, then
> > samtools will copy those md5 checksums into the header of the .cram
> > file, without checking whether they match the contig checksums of
> > the genome reference now being used for compression.  No errors or
> > error messages are generated at this point. But when you try to read
> > the resulting .cram, samtools uses the reference originally used for
> > mapping, not the one used for compression, and the .cram file is
> > unreadable ... until you re-write the .cram header with the contig
> > md5 checksums of the reference sequence actually used for
> > compression.
>
> It was a concious decision to not check a manually specified reference
> matches the M5 header tags because it incurs a *significant* cpu cost,
> especially when handling small portions of a data set and people
> (almost) universally specify the same reference.
>
> However perhaps it should be something we can add using e.g.
> "--input-fmt-option recompute_md5".  Any thoughts on this?
>
> > So your compression procedure will need to be:  first write the
> > .cram using "samtools view", then rewrite the header on the new
> > .cram using "samtools reheader".  An alternative is to rewrite the
> > .bam file first, leaving the contig names but omitting their
> > checksums.  If there are no checksums already, then samtools will
> > automatically put the correct checksums into the .cram file header.
> > (However, the first way, fixing the .cram header, is faster than
> > fixing the .bam header.)
>
> This also leads to broken data if the new vs old references need
> indels to align to each other.  Your CRAM will decode and give the
> same data back again, but all variant calling is completely wrong as
> reads are aligned in different locations.  I think the only case it
> doesn't break horribly is masking (as mentioned).
>

That's a good point.  I imagine we'd always have to realign the sequence
data to the latest reference before calling again (which might even be h38
or other major revision).  But that would still preserve the integrity of
the sequencing data; inability to decompress the CRAM file--data loss--is
much scarier though.

CRAM was never really designed to change references underneath it and
> has implicit assumptions that it's using the aligned reference to
> encode against, as this is what leads to best compression.  I don't
> understand the benefit of specifying another reference.  What is the
> problem you are attempting to fix here?
>
> If you want to cope with the reference being deleted, I'd suggest
> using the embedded reference feature instead which copies the relevant
> parts of the reference into each cram slice.
>

Yes, it's coping with the reference revision being lost and avoiding the
complexity of permanently tracking reference revisions.  We tried
embed_ref, but that makes the CRAM file bigger than the BAM (this is for
target enrichment data).

Sounds like there are enough concerns to avoid this approach though, thanks
Tom and James for the feedback!

Ziga
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to