Re: [Bioc-sig-seq] complement and reverseComplement from a character vector

Hervé Pagès Wed, 17 Jun 2009 12:25:54 -0700

Hi Nicolas,

Added to Biostrings 2.13.21.


Note that in order to keep it simple I've modified it so the input must
be the sizes of the contigs, not the contigs themselves. That way it
remains independent on how the contigs are stored and there is no need
to make it a generic:

  N50 <- function(csizes)
  {
    sorted_csizes <- sort(csizes)
    tmp <- cumsum(sorted_csizes)
    N50 <- sorted_csizes[which(tmp >= max(tmp)/2)[1]]
    return(N50)
  }

Can you please send me the documentation (off-list) with the exact definition
of the N50 value (or a reference to it) and an example?

Thanks!
H.


Nicolas Delhomme wrote:

Dear Hervé,
Sorry for answering so late. Thanks for integrating these methods. Iagree that making them endomorphic is probably the best and mostintuitive solution.
I've been playing with Biostrings lately on "de-novo" assembly data andone important value that one wants to calculate is the so-called "N50"value. It is the size of the contig, which addition to the cumulativecontig size (ordered by decreasing sizes) makes it bigger than half thetotal size of all contigs. Basically it gives an idea of the contig sizedistribution within the data (although, one still needs to know thecontig size range). I could not find any function doing this, so I havewritten a function to calculate it and I think it could be aninteresting addition to the Biostrings package. Here is the code:
setGeneric("N50",function(x) standardGeneric("N50"))
setMethod("N50","XStringSet",function(x){
  # the order
  x.order<-order(width(x),decreasing=TRUE)
  width(x[x.order])[which(cumsum(width(x[x.order]))>=sum(width(x)/2))[[1]]]
  })

An example (using the non endomorphic functions...):

my.contig<-DNAStringSet(
    sapply(
        sample(c(100:10000),10),
        function(size){
            paste(sample(c("A","C","G","T"),size,replace=TRUE),collapse="")
    })
)
N50(my.contig)


If you find it valuable, I could contribute the documentation.

Best wishes,

---------------------------------------------------------------
Nicolas Delhomme

High Throughput Functional Genomics Center

European Molecular Biology Laboratory

Tel: +49 6221 387 8426
Email: [email protected]
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

P.S.

 > sessionInfo()
R version 2.9.0 (2009-04-17)
i386-apple-darwin8.11.1

locale:
en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.12.0 IRanges_1.2.0     Biobase_2.4.0

loaded via a namespace (and not attached):
[1] tools_2.9.0



On 2 Jun 2009, at 20:36, Hervé Pagès wrote:
Hi Nicolas,

Nicolas Delhomme wrote:
Hi all,
As described in the doc for reverse, complement andreverseComplement, x can be a character vector.When I tried, it turned out that these functions are not implementedfor complement and reverseComplement.
There are of some use for me, so I just wrote them up:
setMethod("complement", "character",
   function(x, ...){
           if(length(x)==1) x<-DNAString(x)
       else x<-DNAStringSet(x)
       complement(x)
   }
)
setMethod("reverseComplement", "character",
   function(x, ...){
           if(length(x)==1) x<-DNAString(x)
           else x<- DNAStringSet(x)
           reverseComplement(x)
   }
)
I just post them in case there would be of use for someone else. Irecognize that it does not save much compared to first converting thecharacter vector into a DNAString or DNAStringSet. At least, for me,it allows to skip some "input" evaluation test checking whether Istart with a character vector or a DNAString.
Thanks for the feedback!
The reason these method were not defined is that when the input ischaracter
it's not clear whether it should be treated as DNA or RNA input. However
choosing to treat it as DNA is probably what the user will want 99% ofthe
time so they could indeed be implemented as in your code above. So if the
input contains the "U" letter, they will simply fail (without trying to
be smart).
Note that there is no method for BString/BStringSet objects forexactly the
same reason.
A question subsists though: should these methods return aDNAString/DNAStringSetobject or should the result be turned back into an ordinary charactervector
before it's returned to the user? The latter would make these methods
"endomorphisms" (i.e. the output has the same type as the input) which is
more consistent with what the other methods do but I'm not against making
an exception when the input is character (not a big deal as long asthis isclearly documented). Then if the user really want this result to becharacter
then s/he can always apply as.character() to it.
If there are no objections, I will add these methods to the Biostringspackage.
Cheers,
H.
Best,
---------------------------------------------------------------
Nicolas Delhomme
High Throughput Functional Genomics Center
European Molecular Biology Laboratory
Tel: +49 6221 387 8426
Email: [email protected]
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] complement and reverseComplement from a character vector

Reply via email to