Hi Sean,

The advantage of using choice #2 is better performance (memory
and speed).

In addition to Michael's pointer, there is this post

  https://stat.ethz.ch/pipermail/r-devel/2013-June/066890.html

on R-devel that provides some details and pseudo-code. Unfortunately
a few things have changed since 2013:

  1) You need to replace:

       cachedXVectorList        with   XVectorList_holder
       cachedCharSeq            with   Chars_holder
       cache_XVectorList        with   hold_XVectorList
       get_cachedXRawList_elt   with   get_elt_from_XRawList_holder

  2) For readability you probably want to rename the variables
     accordingly. So replace:

       cached_ans               with   ans_holder
       cached_ans_elt           with   ans_elt_older

  3) The 'seq' member of a Chars_holder struct was renamed 'ptr'.
     So replace:

       cached_ans_elt->seq      with   ans_elt_older->ptr

Also since you want the result in a DNAStringSet object, you'll need
to replace any occurrence of BStringSet with DNAStringSet. One extra
difficulty with DNAStringSet objects is that the incoming string data
needs to be encoded before it's stored in the object. This encoding
can be performed by replacing

  memcpy((char *) cached_ans_elt->seq, foo,
         INTEGER(width)[i] * sizeof(char));

with

  Ocopy_bytes_to_i1i2_with_lkup(0, ans_elt_holder.length - 1,
        (char *) ans_elt_holder.ptr, ans_elt_holder.length,
        foo, INTEGER(width)[i] * sizeof(char),
        INTEGER(lkup), LENGTH(lkup));

where 'lkup' is an integer vector that you need to prepare at the
R level with:

  lkup <- get_seqtype_conversion_lookup("B", "DNA")

and that you then need to pass all the way down to the
read_text_file_in_DNAStringSet() function as an extra argument:

SEXP read_text_file_in_DNAStringSet(FILE *FN, SEXP lkup)
{
    ...
}

Let me know if you need further help with this.

H.

On 01/05/2016 05:31 AM, Michael Lawrence wrote:
You're going to want to use the Biostrings C API. You can see
TwoBitFile_read() in src/twoBit.c in rtracklayer for an example. It's
not trivial.

Michael


On Tue, Jan 5, 2016 at 5:21 AM, Sean Davis <seand...@gmail.com> wrote:
Forgive the really naive questions, but my C/C++ and Biostrings skills are
pretty rough.  I have been playing with Rcpp to interface to an external
library for reading sequence formats.  I am reading DNA sequences
one-at-a-time and would like to package a set of those up as a DNAStringSet.
I have at least two options:

1) Return a character vector back to R and then create a DNAStringSet in R.
This works now and works fine.
2) Create the DNAStringSet in the cpp code.

Are there advantages to using choice #2?  If so, where would I start to get
the necessary Rcpp/Biostrings glue in place? I know how to make new S4
objects in Rcpp, so it is really a question of knowing what class hierarchy
I need to instantiate and, roughly, how.  Any pointers or code examples?

Thanks,
Sean

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to