Simon,
Harris has been spending the most time thinking and working on this issue. I am including him in this e-mail.

Harris,
Do you think we should change the default behavior? I know you mentioned this is a previous e-mail and if users are not getting the results they are expecting, perhaps now is the time to bite the bullet (by breaking backwards compatibility) and change how 0 is interpreted by the max.[LR]mismatch parameter.


Patrick


On 2/17/10 5:53 AM, Simon Anders wrote:
Hi Patrick

I've just tried to trim adaptors of my Solexa reads and got a bit puzzled about trimLRPatterns's max.Rmismatch argument.

Let's say I have two sequences as follows:

> library(ShortRead)
[...]
> subject <- DNAString( "CGCCGGCCGAAATTTAA" )
> pattern <- DNAString( "AAATTTAAATTTAAATTT" )

These have a clear overlapping alignment without mismatch:

   CGCCGGCCGAAATTTAA <- subject
            AAATTTAAATTTAAATTT <- (right) pattern

   CGCCGGCCG <- trimmed subject

So, I would expect trimLRpattern to trim it to the given "trimmed subject" sequence. However:

> trimLRPatterns( subject=subject, Rpattern=pattern )
  17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA

As you can see, the subject untrimmed.

If I now specify allow for 10% mismatch, trimming works:

> trimLRPatterns( subject=subject, Rpattern=pattern, max.Rmismatch=.1 )
  9-letter "DNAString" instance
seq: CGCCGGCCG

Why do I need to allow for mismatches?


This here is even a bit stranger: COmpare the result of allowing for very small mismatches, say 10^-10, with exactly 0:

> trimLRPatterns( subject=subject, Rpattern=pattern,
+    max.Rmismatch=1e-10 )
  9-letter "DNAString" instance
seq: CGCCGGCCG

> trimLRPatterns( subject=subject, Rpattern=pattern,
+    max.Rmismatch=0 )
  17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA

This is a bit dis-continuous, isn't it? ;-)


Cheers
  Simon

> sessionInfo()
R version 2.11.0 Under development (unstable) (--)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.5.15 lattice_0.17-26 BSgenome_1.15.4 Biostrings_2.15.21
[5] IRanges_1.5.46

loaded via a namespace (and not attached):
[1] Biobase_2.7.2 grid_2.11.0   hwriter_1.1   tools_2.11.0


+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: [email protected]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to