Thanks all for your replies. #define ASCII_MASK (1<<6) bool is_ascii_internal(Rcpp::String xi) { return (LEVELS(xi.get_sexp()) & ASCII_MASK) != 0; } This is almost exactly the solution I was attempting after reading the src/include/Defn.h file, but was missing LEVELS().
Tomas wrote well about it: https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html That's a great read, thanks for sharing. "One day" this may be easier. Until then the best we can do may be to borrow helper functions from R Agreed with both sentences. Accessing some helper functions from R would help a lot while at the same time maintaining consistency from the time being. On Thu, Feb 18, 2021 at 2:37 PM Travers Ching <trave...@gmail.com> wrote: > Hi Kevin and Andrade, > > I was also once looking for a way to test if strings are ASCII. Although > looping over the characters would be fast enough in most cases, it isn't > efficient or necessary. > > The IS_ASCII function isn't visible to users and it isn't obvious how one > can re-implement the function unless you read R source code. I would agree > with Andrade that a function in Rcpp would be helpful. Here's one > re-implementation of the internal R function: > > #include <Rcpp.h> > > #define ASCII_MASK (1<<6) > bool is_ascii_internal(Rcpp::String xi) { > return (LEVELS(xi.get_sexp()) & ASCII_MASK) != 0; > } > > // [[Rcpp::export]] > bool is_ascii(Rcpp::CharacterVector x) { > return is_ascii_internal(x[0]); > } > > Travers > > > On Thu, Feb 18, 2021 at 11:18 AM Kevin Ushey <kevinus...@gmail.com> wrote: > >> On Thu, Feb 18, 2021 at 1:36 AM Andrade Solomon >> <andradesolomon2...@gmail.com> wrote: >> > >> > Dear list, >> > >> > I see here that Rccp strings have both a get_encoding() and a >> set_encoding() member functions, which respectively return and accept a >> cetype_t enum defined in Rinternals.h with options: >> > CE_NATIVE = 0, >> > CE_UTF8 = 1, >> > CE_LATIN1 = 2, >> > CE_BYTES = 3, >> > CE_SYMBOL = 5, >> > CE_ANY = 99 >> > >> > This means that if the String is UTF-8, Latin1 or Bytecode, the >> String's get_encoding() member function will return 1, 2 or 3, >> respectively. Experimentally, I see that when I try it with string objects >> containing only 0 to 127 ASCII characters (with no manually set encoding), >> the get_encoding() member function returns 0, which means CE_NATIVE in the >> aforementioned enum. At first, I just assumed that, then, in Rcpp's eyes, >> ASCII would be considered CE_NATIVE. This could even make sense with what >> is described in R's Encoding() command help entry, i.e. that "character >> strings in R can be declared to be encoded in 'latin1' or 'UTF-8' or as >> 'bytes' (...) ASCII strings will never be marked with a declared encoding, >> since their representation is the same in all supported encodings". >> However, I later realized that was not the case: if one creates an object >> that stores a UTF-8 or Latin-1 string (e.g. x <- "á") then manually drops >> the encoding (i.e. Encoding(x) <- ""), if that object (i.e. x) was passed >> to Rcpp its get_encoding() would still return 0 (which suggests that >> CE_NATIVE corresponds to the "unknown" label returned by the Encoding() >> command). >> > >> > Note that, in R's official documentation, nothing is said about >> CE_NATIVE and, conversely, it is explicitly said that "Value CE_ANY is used >> to indicate a character string that will not need re-encoding – this is >> used for character strings known to be in ASCII, and can also be used as an >> input parameter where the intention is that the string is treated as a >> series of bytes". With this last bit of information in mind, I would then >> have expected that strings containing simple 0-127 ASCII characters and no >> manually set encoding, when passed to a Rcpp code would then have their >> get_encoding() member function return 99 instead of 0 - hence making it >> easy to check within Rcpp whether a string was ASCII only. That not being >> the case, my question actually becomes two-folded: >> > >> > 1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for >> ASCII only text? >> >> I think this is because this is what R does; e.g. >> >> Encoding("ascii") =>. "unknown" >> >> As far as I can tell, CE_ANY is used only sparsely by R itself >> internally, and isn't really surfaced as a "public" encoding to be >> used. >> >> > 2) is there an established way to properly check within Rcpp whether a >> Rcpp String is ASCII only (besides obviously looping over each character to >> check if it's <128) just like it is done in R's C API with the IS_ASCII >> macro? >> >> I think this is the most reasonable way forward. If you need something >> more complicated or specific, I would honestly just recommend rolling >> your own class with the behaviors you need. >> >> If you think there's a way to make this happen with Rcpp's own String >> class, then a pull request would be welcomed. >> >> > Thanks, >> > >> > Andrade Solomon >> > >> > >> > _______________________________________________ >> > Rcpp-devel mailing list >> > Rcpp-devel@lists.r-forge.r-project.org >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel >> _______________________________________________ >> Rcpp-devel mailing list >> Rcpp-devel@lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel > >
_______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel