I discovered the samtools view -m option today, which is quite *baffling*. The help and man page both state:
-m INT only include reads with number of CIGAR operations consuming query sequence >= INT [0] So firstly there is some confusion on the wording. To me "number of CIGAR operations" implies 10M is 1 and 3M2I4M1D1M is 5, but this is not what it means; both are value 10 (the total number of query sequence bases matched by a cigar operation). I struggled for a while to work out why it's phrased this way and not simply "length of mapped query". It is not legal in SAM to have a CIGAR string and query sequence with mismatched lengths except for unmapped data, and if we're explicitly stating "CIGAR operations consuming query sequence" then we're simply counting the sequence length via a very contorted fashion. The code even calls this option "min_qlen" internally so it was clearly the intention to check length. Or so I thought... The code actually does this (commit fbb4b135, Mar 24 2012 by Heng Li, message "filter on query sequence length (not aln len)"): + if ((bam_cigar_type(bam_cigar_op(cigar[k]))&1) || bam_cigar_op(cigar[k]) == BAM_CHARD_CLIP) + qlen += bam_cigar_oplen(cigar[k]); So here it is explicitly cigar ops that consume query sequence *OR* are hard clips. Why? This sails completely against the help and man page, as hard clips are quite evidentally not consuming query sequence. Is this attempting to mean "length of original query sequence, pre clipping"? If so it should state this, but I'm wondering what the actual intention was as it's not clear in the minimal commit message. The message implies there were earlier commits and this is a fix ("not aln len"), but I can't find them so I assume this was squashed at some stage with the earlier rationale not making it in. So two questions. 1) For Heng: what task was this code originally added for? Is it an error that hard clips are considered as query bases? Either the code needs fixing or the help message, but I am unsure which is in error. 2) For all: do you use this option, if so what for? Were you aware of hard-clips being counted? Regards, James -- James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help