Absolutely!
On Feb 3, 2010, at 2:57 PM, Harris A. Jaffee wrote:
<patch_for_trimLRPatterns>
This is my solution. It vastly simplifies things, supports indels
on both sides,
and corrects the behavior of trimming based on non-flanking
matches. All matching
is now tied to the edge of the subject (i.e., starting.at=1 and
going forward, or
ending.at = "the end" and extending back). Trimming is now by the
length of the
longest acceptably matching pattern rather than based on any match,
of which there
can of course be many within the edit tolerance. Setting the
tolerances with indels
allowed is now clear enough from the man page that even I
understand it, whereas
before I had no chance! ...
I did NOT change the default R/L-mismatch defaults of 0, but I
wanted to. I
believe 0 should come under the realm of an "error rate" (currently
required to be
>0 and <1). So, I suggest that 0 becomes an all zero vector of the
length of the
R/L-pattern, rather than a bunch of -1's plus one 0 at the end.
So, I would change the man page (and its supporting code) to
something like
A single numeric value in '[0, 1)' is taken as a "rate" ...
This would break backward compatibility, but I view it more as fixing
a buggy
spec/implementation.
The indels story is a little different. I view that as a faulty
implementation,
which I believe I fixed, at least partially, of a good spec (trimming
*flanking*
patterns). "partially" gets complicated, so I won't explain here...
-Harris
On Feb 17, 2010, at 1:04 PM, Patrick Aboyoun wrote:
Simon,
Harris has been spending the most time thinking and working on this
issue. I am including him in this e-mail.
Harris,
Do you think we should change the default behavior? I know you
mentioned this is a previous e-mail and if users are not getting
the results they are expecting, perhaps now is the time to bite the
bullet (by breaking backwards compatibility) and change how 0 is
interpreted by the max.[LR]mismatch parameter.
Patrick
On 2/17/10 5:53 AM, Simon Anders wrote:
Hi Patrick
I've just tried to trim adaptors of my Solexa reads and got a bit
puzzled about trimLRPatterns's max.Rmismatch argument.
Let's say I have two sequences as follows:
> library(ShortRead)
[...]
> subject <- DNAString( "CGCCGGCCGAAATTTAA" )
> pattern <- DNAString( "AAATTTAAATTTAAATTT" )
These have a clear overlapping alignment without mismatch:
CGCCGGCCGAAATTTAA <- subject
AAATTTAAATTTAAATTT <- (right) pattern
CGCCGGCCG <- trimmed subject
So, I would expect trimLRpattern to trim it to the given "trimmed
subject" sequence. However:
> trimLRPatterns( subject=subject, Rpattern=pattern )
17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA
As you can see, the subject untrimmed.
If I now specify allow for 10% mismatch, trimming works:
> trimLRPatterns( subject=subject, Rpattern=pattern,
max.Rmismatch=.1 )
9-letter "DNAString" instance
seq: CGCCGGCCG
Why do I need to allow for mismatches?
This here is even a bit stranger: COmpare the result of allowing
for very small mismatches, say 10^-10, with exactly 0:
> trimLRPatterns( subject=subject, Rpattern=pattern,
+ max.Rmismatch=1e-10 )
9-letter "DNAString" instance
seq: CGCCGGCCG
> trimLRPatterns( subject=subject, Rpattern=pattern,
+ max.Rmismatch=0 )
17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA
This is a bit dis-continuous, isn't it? ;-)
Cheers
Simon
> sessionInfo()
R version 2.11.0 Under development (unstable) (--)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.5.15 lattice_0.17-26 BSgenome_1.15.4
Biostrings_2.15.21
[5] IRanges_1.5.46
loaded via a namespace (and not attached):
[1] Biobase_2.7.2 grid_2.11.0 hwriter_1.1 tools_2.11.0
+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: [email protected]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing