Hi Kevin and Andrade, I was also once looking for a way to test if strings are ASCII. Although looping over the characters would be fast enough in most cases, it isn't efficient or necessary.
The IS_ASCII function isn't visible to users and it isn't obvious how one can re-implement the function unless you read R source code. I would agree with Andrade that a function in Rcpp would be helpful. Here's one re-implementation of the internal R function: #include <Rcpp.h> #define ASCII_MASK (1<<6) bool is_ascii_internal(Rcpp::String xi) { return (LEVELS(xi.get_sexp()) & ASCII_MASK) != 0; } // [[Rcpp::export]] bool is_ascii(Rcpp::CharacterVector x) { return is_ascii_internal(x[0]); } Travers On Thu, Feb 18, 2021 at 11:18 AM Kevin Ushey <kevinus...@gmail.com> wrote: > On Thu, Feb 18, 2021 at 1:36 AM Andrade Solomon > <andradesolomon2...@gmail.com> wrote: > > > > Dear list, > > > > I see here that Rccp strings have both a get_encoding() and a > set_encoding() member functions, which respectively return and accept a > cetype_t enum defined in Rinternals.h with options: > > CE_NATIVE = 0, > > CE_UTF8 = 1, > > CE_LATIN1 = 2, > > CE_BYTES = 3, > > CE_SYMBOL = 5, > > CE_ANY = 99 > > > > This means that if the String is UTF-8, Latin1 or Bytecode, the String's > get_encoding() member function will return 1, 2 or 3, respectively. > Experimentally, I see that when I try it with string objects containing > only 0 to 127 ASCII characters (with no manually set encoding), the > get_encoding() member function returns 0, which means CE_NATIVE in the > aforementioned enum. At first, I just assumed that, then, in Rcpp's eyes, > ASCII would be considered CE_NATIVE. This could even make sense with what > is described in R's Encoding() command help entry, i.e. that "character > strings in R can be declared to be encoded in 'latin1' or 'UTF-8' or as > 'bytes' (...) ASCII strings will never be marked with a declared encoding, > since their representation is the same in all supported encodings". > However, I later realized that was not the case: if one creates an object > that stores a UTF-8 or Latin-1 string (e.g. x <- "á") then manually drops > the encoding (i.e. Encoding(x) <- ""), if that object (i.e. x) was passed > to Rcpp its get_encoding() would still return 0 (which suggests that > CE_NATIVE corresponds to the "unknown" label returned by the Encoding() > command). > > > > Note that, in R's official documentation, nothing is said about > CE_NATIVE and, conversely, it is explicitly said that "Value CE_ANY is used > to indicate a character string that will not need re-encoding – this is > used for character strings known to be in ASCII, and can also be used as an > input parameter where the intention is that the string is treated as a > series of bytes". With this last bit of information in mind, I would then > have expected that strings containing simple 0-127 ASCII characters and no > manually set encoding, when passed to a Rcpp code would then have their > get_encoding() member function return 99 instead of 0 - hence making it > easy to check within Rcpp whether a string was ASCII only. That not being > the case, my question actually becomes two-folded: > > > > 1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for > ASCII only text? > > I think this is because this is what R does; e.g. > > Encoding("ascii") =>. "unknown" > > As far as I can tell, CE_ANY is used only sparsely by R itself > internally, and isn't really surfaced as a "public" encoding to be > used. > > > 2) is there an established way to properly check within Rcpp whether a > Rcpp String is ASCII only (besides obviously looping over each character to > check if it's <128) just like it is done in R's C API with the IS_ASCII > macro? > > I think this is the most reasonable way forward. If you need something > more complicated or specific, I would honestly just recommend rolling > your own class with the behaviors you need. > > If you think there's a way to make this happen with Rcpp's own String > class, then a pull request would be welcomed. > > > Thanks, > > > > Andrade Solomon > > > > > > _______________________________________________ > > Rcpp-devel mailing list > > Rcpp-devel@lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel > _______________________________________________ > Rcpp-devel mailing list > Rcpp-devel@lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
_______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel