I once attempted to use this option to quickly remove most non-spliced
alignments from an RNAseq dataset. The code snippet explains why that
attempt failed spectacularly. I hadn't a clue that this included
hard-clipping operations, though I see alignments with them so rarely
that that wouldn't have been an issue.

Devon
--
Devon Ryan, Ph.D.
Email: dpr...@dpryan.com
Data Manager/Bioinformatician
Max Planck Institute of Immunobiology and Epigenetics
Stübeweg 51
79108 Freiburg
Germany


On Tue, Jan 10, 2017 at 4:19 PM, James Bonfield <j...@sanger.ac.uk> wrote:
> I discovered the samtools view -m option today, which is quite
> *baffling*.
>
> The help and man page both state:
>
>   -m INT   only include reads with number of CIGAR operations consuming
>            query sequence >= INT [0]
>
> So firstly there is some confusion on the wording.  To me "number of
> CIGAR operations" implies 10M is 1 and 3M2I4M1D1M is 5, but this is
> not what it means; both are value 10 (the total number of query
> sequence bases matched by a cigar operation).
>
> I struggled for a while to work out why it's phrased this way and not
> simply "length of mapped query".  It is not legal in SAM to have a
> CIGAR string and query sequence with mismatched lengths except for
> unmapped data, and if we're explicitly stating "CIGAR operations
> consuming query sequence" then we're simply counting the sequence
> length via a very contorted fashion.  The code even calls this option
> "min_qlen" internally so it was clearly the intention to check length.
>
> Or so I thought...  The code actually does this (commit fbb4b135, Mar
> 24 2012 by Heng Li, message "filter on query sequence length (not aln
> len)"):
>
> +   if ((bam_cigar_type(bam_cigar_op(cigar[k]))&1) || bam_cigar_op(cigar[k]) 
> == BAM_CHARD_CLIP)
> +           qlen += bam_cigar_oplen(cigar[k]);
>
> So here it is explicitly cigar ops that consume query sequence *OR*
> are hard clips.  Why?  This sails completely against the help and man
> page, as hard clips are quite evidentally not consuming query
> sequence.
>
> Is this attempting to mean "length of original query sequence, pre
> clipping"?  If so it should state this, but I'm wondering what the
> actual intention was as it's not clear in the minimal commit message.
> The message implies there were earlier commits and this is a fix ("not
> aln len"), but I can't find them so I assume this was squashed at some
> stage with the earlier rationale not making it in.
>
> So two questions.
>
> 1) For Heng: what task was this code originally added for?  Is it an
> error that hard clips are considered as query bases?  Either the code
> needs fixing or the help message, but I am unsure which is in error.
>
> 2) For all: do you use this option, if so what for? Were you aware of
> hard-clips being counted?
>
> Regards,
>
> James
>
> --
> James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
>                                   | Plurima gyrabant gymbolitare vabo;
>   A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
> https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi.
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today. http://sdm.link/xeonphi
> _______________________________________________
> Samtools-devel mailing list
> samtools-de...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-devel

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to