Absolutely!

On Feb 3, 2010, at 2:57 PM, Harris A. Jaffee wrote:
<patch_for_trimLRPatterns>
This is my solution. It vastly simplifies things, supports indels on both sides, and corrects the behavior of trimming based on non-flanking matches. All matching is now tied to the edge of the subject (i.e., starting.at=1 and going forward, or ending.at = "the end" and extending back). Trimming is now by the length of the longest acceptably matching pattern rather than based on any match, of which there can of course be many within the edit tolerance. Setting the tolerances with indels allowed is now clear enough from the man page that even I understand it, whereas
before I had no chance! ...
I did NOT change the default R/L-mismatch defaults of 0, but I wanted to. I believe 0 should come under the realm of an "error rate" (currently required to be >0 and <1). So, I suggest that 0 becomes an all zero vector of the length of the
R/L-pattern, rather than a bunch of -1's plus one 0 at the end.

So, I would change the man page (and its supporting code) to something like

        A single numeric value in '[0, 1)' is taken as a "rate" ...

This would break backward compatibility, but I view it more as fixing a buggy
spec/implementation.

The indels story is a little different. I view that as a faulty implementation, which I believe I fixed, at least partially, of a good spec (trimming *flanking*
patterns).  "partially" gets complicated, so I won't explain here...

-Harris

On Feb 17, 2010, at 1:04 PM, Patrick Aboyoun wrote:
Simon,
Harris has been spending the most time thinking and working on this issue. I am including him in this e-mail.

Harris,
Do you think we should change the default behavior? I know you mentioned this is a previous e-mail and if users are not getting the results they are expecting, perhaps now is the time to bite the bullet (by breaking backwards compatibility) and change how 0 is interpreted by the max.[LR]mismatch parameter.


Patrick


On 2/17/10 5:53 AM, Simon Anders wrote:
Hi Patrick

I've just tried to trim adaptors of my Solexa reads and got a bit puzzled about trimLRPatterns's max.Rmismatch argument.

Let's say I have two sequences as follows:

> library(ShortRead)
[...]
> subject <- DNAString( "CGCCGGCCGAAATTTAA" )
> pattern <- DNAString( "AAATTTAAATTTAAATTT" )

These have a clear overlapping alignment without mismatch:

   CGCCGGCCGAAATTTAA <- subject
            AAATTTAAATTTAAATTT <- (right) pattern

   CGCCGGCCG <- trimmed subject

So, I would expect trimLRpattern to trim it to the given "trimmed subject" sequence. However:

> trimLRPatterns( subject=subject, Rpattern=pattern )
  17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA

As you can see, the subject untrimmed.

If I now specify allow for 10% mismatch, trimming works:

> trimLRPatterns( subject=subject, Rpattern=pattern, max.Rmismatch=.1 )
  9-letter "DNAString" instance
seq: CGCCGGCCG

Why do I need to allow for mismatches?


This here is even a bit stranger: COmpare the result of allowing for very small mismatches, say 10^-10, with exactly 0:

> trimLRPatterns( subject=subject, Rpattern=pattern,
+    max.Rmismatch=1e-10 )
  9-letter "DNAString" instance
seq: CGCCGGCCG

> trimLRPatterns( subject=subject, Rpattern=pattern,
+    max.Rmismatch=0 )
  17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA

This is a bit dis-continuous, isn't it? ;-)


Cheers
  Simon

> sessionInfo()
R version 2.11.0 Under development (unstable) (--)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.5.15 lattice_0.17-26 BSgenome_1.15.4 Biostrings_2.15.21
[5] IRanges_1.5.46

loaded via a namespace (and not attached):
[1] Biobase_2.7.2 grid_2.11.0   hwriter_1.1   tools_2.11.0


+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: [email protected]


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to