Re: [Bioc-sig-seq] trimLRPatters: inconstitency re max.[LR]mismatch

Harris A. Jaffee Wed, 17 Feb 2010 10:57:06 -0800

Absolutely!

On Feb 3, 2010, at 2:57 PM, Harris A. Jaffee wrote:

<patch_for_trimLRPatterns>
This is my solution. It vastly simplifies things, supports indelson both sides,and corrects the behavior of trimming based on non-flankingmatches. All matchingis now tied to the edge of the subject (i.e., starting.at=1 andgoing forward, orending.at = "the end" and extending back). Trimming is now by thelength of thelongest acceptably matching pattern rather than based on any match,of which therecan of course be many within the edit tolerance. Setting thetolerances with indelsallowed is now clear enough from the man page that even Iunderstand it, whereas
before I had no chance! ...
I did NOT change the default R/L-mismatch defaults of 0, but Iwanted to. Ibelieve 0 should come under the realm of an "error rate" (currentlyrequired to be>0 and <1). So, I suggest that 0 becomes an all zero vector of thelength of the
R/L-pattern, rather than a bunch of -1's plus one 0 at the end.

So, I would change the man page (and its supporting code) tosomething like


        A single numeric value in '[0, 1)' is taken as a "rate" ...

This would break backward compatibility, but I view it more as fixinga buggy

spec/implementation.

The indels story is a little different. I view that as a faultyimplementation,which I believe I fixed, at least partially, of a good spec (trimming*flanking*

patterns).  "partially" gets complicated, so I won't explain here...

-Harris

On Feb 17, 2010, at 1:04 PM, Patrick Aboyoun wrote:

Simon,
Harris has been spending the most time thinking and working on thisissue. I am including him in this e-mail.
Harris,
Do you think we should change the default behavior? I know youmentioned this is a previous e-mail and if users are not gettingthe results they are expecting, perhaps now is the time to bite thebullet (by breaking backwards compatibility) and change how 0 isinterpreted by the max.[LR]mismatch parameter.
Patrick


On 2/17/10 5:53 AM, Simon Anders wrote:
Hi Patrick
I've just tried to trim adaptors of my Solexa reads and got a bitpuzzled about trimLRPatterns's max.Rmismatch argument.
Let's say I have two sequences as follows:

> library(ShortRead)
[...]
> subject <- DNAString( "CGCCGGCCGAAATTTAA" )
> pattern <- DNAString( "AAATTTAAATTTAAATTT" )

These have a clear overlapping alignment without mismatch:

   CGCCGGCCGAAATTTAA <- subject
            AAATTTAAATTTAAATTT <- (right) pattern

   CGCCGGCCG <- trimmed subject
So, I would expect trimLRpattern to trim it to the given "trimmedsubject" sequence. However:
> trimLRPatterns( subject=subject, Rpattern=pattern )
  17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA

As you can see, the subject untrimmed.

If I now specify allow for 10% mismatch, trimming works:
> trimLRPatterns( subject=subject, Rpattern=pattern,max.Rmismatch=.1 )
  9-letter "DNAString" instance
seq: CGCCGGCCG

Why do I need to allow for mismatches?
This here is even a bit stranger: COmpare the result of allowingfor very small mismatches, say 10^-10, with exactly 0:
> trimLRPatterns( subject=subject, Rpattern=pattern,
+    max.Rmismatch=1e-10 )
  9-letter "DNAString" instance
seq: CGCCGGCCG

> trimLRPatterns( subject=subject, Rpattern=pattern,
+    max.Rmismatch=0 )
  17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA

This is a bit dis-continuous, isn't it? ;-)


Cheers
  Simon

> sessionInfo()
R version 2.11.0 Under development (unstable) (--)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.5.15 lattice_0.17-26 BSgenome_1.15.4Biostrings_2.15.21
[5] IRanges_1.5.46

loaded via a namespace (and not attached):
[1] Biobase_2.7.2 grid_2.11.0   hwriter_1.1   tools_2.11.0


+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: [email protected]


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] trimLRPatters: inconstitency re max.[LR]mismatch

Reply via email to