Ziga  -

In my experience, the .cram format is engineered to support this, but the samtools software has one feature which causes all kinds of trouble with it.

The problematic feature is this: if the header of the existing .bam file already contains md5 checksums in the "@SQ" header lines for the individual hg19 sequence contigs used in read mapping, then samtools will copy those md5 checksums into the header of the .cram file, without checking whether they match the contig checksums of the genome reference now being used for compression. No errors or error messages are generated at this point. But when you try to read the resulting .cram, samtools uses the reference originally used for mapping, not the one used for compression, and the .cram file is unreadable ... until you re-write the .cram header with the contig md5 checksums of the reference sequence actually used for compression.

So your compression procedure will need to be: first write the .cram using "samtools view", then rewrite the header on the new .cram using "samtools reheader". An alternative is to rewrite the .bam file first, leaving the contig names but omitting their checksums. If there are no checksums already, then samtools will automatically put the correct checksums into the .cram file header. (However, the first way, fixing the .cram header, is faster than fixing the .bam header.)

                                                         -  tom blackwell  -

On Thu, 2 Aug 2018, Ziga Mahkovec wrote:

Question about BAM->CRAM compression and reference sequence files: does the
reference used for CRAM compression/decompression have to be identical to
the one used for aligning the BAM file?

In our pipeline, we periodically create minor revisions of the reference,
e.g. masking a pseudogene region in hg19.  We're now considering
re-compressing BAM files aligned with those references to CRAM files (for
long-term storage), but for data durability reasons we'd prefer to use an
hg19 reference from the EBI CRAM reference registry or similar; that way,
we don't have to rely on versioning and long-term storage of all the
reference revisions.

Would this work?  I'm less worried about compression efficiency (which I
assume would not be affected as long as the reference revisions are small)
and more about data integrity and durability.

Thanks,
Ziga


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to