I once attempted to use this option to quickly remove most non-spliced alignments from an RNAseq dataset. The code snippet explains why that attempt failed spectacularly. I hadn't a clue that this included hard-clipping operations, though I see alignments with them so rarely that that wouldn't have been an issue.
Devon -- Devon Ryan, Ph.D. Email: dpr...@dpryan.com Data Manager/Bioinformatician Max Planck Institute of Immunobiology and Epigenetics Stübeweg 51 79108 Freiburg Germany On Tue, Jan 10, 2017 at 4:19 PM, James Bonfield <j...@sanger.ac.uk> wrote: > I discovered the samtools view -m option today, which is quite > *baffling*. > > The help and man page both state: > > -m INT only include reads with number of CIGAR operations consuming > query sequence >= INT [0] > > So firstly there is some confusion on the wording. To me "number of > CIGAR operations" implies 10M is 1 and 3M2I4M1D1M is 5, but this is > not what it means; both are value 10 (the total number of query > sequence bases matched by a cigar operation). > > I struggled for a while to work out why it's phrased this way and not > simply "length of mapped query". It is not legal in SAM to have a > CIGAR string and query sequence with mismatched lengths except for > unmapped data, and if we're explicitly stating "CIGAR operations > consuming query sequence" then we're simply counting the sequence > length via a very contorted fashion. The code even calls this option > "min_qlen" internally so it was clearly the intention to check length. > > Or so I thought... The code actually does this (commit fbb4b135, Mar > 24 2012 by Heng Li, message "filter on query sequence length (not aln > len)"): > > + if ((bam_cigar_type(bam_cigar_op(cigar[k]))&1) || bam_cigar_op(cigar[k]) > == BAM_CHARD_CLIP) > + qlen += bam_cigar_oplen(cigar[k]); > > So here it is explicitly cigar ops that consume query sequence *OR* > are hard clips. Why? This sails completely against the help and man > page, as hard clips are quite evidentally not consuming query > sequence. > > Is this attempting to mean "length of original query sequence, pre > clipping"? If so it should state this, but I'm wondering what the > actual intention was as it's not clear in the minimal commit message. > The message implies there were earlier commits and this is a fix ("not > aln len"), but I can't find them so I assume this was squashed at some > stage with the earlier rationale not making it in. > > So two questions. > > 1) For Heng: what task was this code originally added for? Is it an > error that hard clips are considered as query bases? Either the code > needs fixing or the help message, but I am unsure which is in error. > > 2) For all: do you use this option, if so what for? Were you aware of > hard-clips being counted? > > Regards, > > James > > -- > James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova > | Plurima gyrabant gymbolitare vabo; > A Staden Package developer: | Et Borogovorum mimzebant undique formae, > https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today. http://sdm.link/xeonphi > _______________________________________________ > Samtools-devel mailing list > samtools-de...@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/samtools-devel ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi _______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help