I am looking at implementing a lossy read-name compressor for CRAM in Samtools / Scramble. One flaw in this is the lack of knowledge of how many reads may occur. If I have a function to return the expected number of times this read name may occur in the entire file then I can count how many times it occurs within this CRAM slice and safely throw away the read names if the two match, otherwise keep them. Seems simple, but how to define that function returning the expected read-name count.
My thought process so far is: If FLAG&0x1 is 0 then the expected count is 1 subject to subsequent rules If FLAG&0x1 is 1 then the expected count is 2 subject to subsequent rules If aux TC is defined then it specifies the expected count. (Does it?) If aux SA is defined then it indicates the number of additional reads *for this end only*. Perhaps also check RNEXT/PNEXT combinations; if any exist outside of the observed slice of records for a template, irrespective of the claimed flags, TC or SA, then set expected count to INT_MAX (force all names to be kept for this template). TC is worded as "the number of segments in the template". Is it fair to assume that this is the same as the total number of alignment records produced by the aligner, regardless of whether it emitted secondary and/or supplementary alignment records?[1] It may be overly large in case of a region query pulling back a sub-section, but that's not a bad failure mode compared to being too small. SA is tricky as it only indicates how many additional reads exist for this end, so BAM_FREAD1 flag has one SA list, BAM_FREAD2 flag has another SA list, and I assume there exists another SA list for other combinations (both or neither 1/2 flags set)? Where do secondary alignments fit in? SA is strictly referring to a chimeric alignment, so a single read that has been split into multiple parts during alignment. If there are secondary alignments, is it correct to assume that it too may have its own SA list. Eg if a single-ended template can map in two places and each place is a chimeric alignment with 3 segments, meaning 6 SAM records in total each of which 2 other reads listed in their SA tag? Is there any way at all to tell that a secondary alignment exists from another SAM entry? If not I think the whole thing is impossible without first sorting to name order, adding a tag (maybe TC as mentioned above) and sorting back again. Very ugly and slow! James [1] Right now I don't know of any aligners that output this, although it would be very useful for read-name collation algorithms. It's something I'm considering adding to samtools fixmates. -- James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help