I am looking at implementing a lossy read-name compressor for CRAM in
Samtools / Scramble.  One flaw in this is the lack of knowledge of how
many reads may occur.  If I have a function to return the expected
number of times this read name may occur in the entire file then I can
count how many times it occurs within this CRAM slice and safely throw
away the read names if the two match, otherwise keep them.  Seems
simple, but how to define that function returning the expected
read-name count.

My thought process so far is:

If FLAG&0x1 is 0 then the expected count is 1 subject to subsequent rules
If FLAG&0x1 is 1 then the expected count is 2 subject to subsequent rules

If aux TC is defined then it specifies the expected count. (Does it?)
If aux SA is defined then it indicates the number of additional reads
   *for this end only*. 

Perhaps also check RNEXT/PNEXT combinations; if any exist outside of
the observed slice of records for a template, irrespective of the
claimed flags, TC or SA, then set expected count to INT_MAX (force
all names to be kept for this template).

TC is worded as "the number of segments in the template".  Is it fair
to assume that this is the same as the total number of alignment
records produced by the aligner, regardless of whether it emitted
secondary and/or supplementary alignment records?[1]  It may be overly
large in case of a region query pulling back a sub-section, but that's
not a bad failure mode compared to being too small.

SA is tricky as it only indicates how many additional reads exist for
this end, so BAM_FREAD1 flag has one SA list, BAM_FREAD2 flag has
another SA list, and I assume there exists another SA list for other
combinations (both or neither 1/2 flags set)? 

Where do secondary alignments fit in?  SA is strictly referring to a
chimeric alignment, so a single read that has been split into multiple
parts during alignment.  If there are secondary alignments, is it
correct to assume that it too may have its own SA list.  Eg if a
single-ended template can map in two places and each place is a
chimeric alignment with 3 segments, meaning 6 SAM records in total
each of which 2 other reads listed in their SA tag?

Is there any way at all to tell that a secondary alignment exists from
another SAM entry?  If not I think the whole thing is impossible
without first sorting to name order, adding a tag (maybe TC as
mentioned above) and sorting back again.  Very ugly and slow!

James

[1] Right now I don't know of any aligners that output this, although
it would be very useful for read-name collation algorithms.  It's
something I'm considering adding to samtools fixmates.

-- 
James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
                                  | Plurima gyrabant gymbolitare vabo;
  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to