On Thu, Feb 18, 2021 at 1:36 AM Andrade Solomon <andradesolomon2...@gmail.com> wrote: > > Dear list, > > I see here that Rccp strings have both a get_encoding() and a set_encoding() > member functions, which respectively return and accept a cetype_t enum > defined in Rinternals.h with options: > CE_NATIVE = 0, > CE_UTF8 = 1, > CE_LATIN1 = 2, > CE_BYTES = 3, > CE_SYMBOL = 5, > CE_ANY = 99 > > This means that if the String is UTF-8, Latin1 or Bytecode, the String's > get_encoding() member function will return 1, 2 or 3, respectively. > Experimentally, I see that when I try it with string objects containing only > 0 to 127 ASCII characters (with no manually set encoding), the get_encoding() > member function returns 0, which means CE_NATIVE in the aforementioned enum. > At first, I just assumed that, then, in Rcpp's eyes, ASCII would be > considered CE_NATIVE. This could even make sense with what is described in > R's Encoding() command help entry, i.e. that "character strings in R can be > declared to be encoded in 'latin1' or 'UTF-8' or as 'bytes' (...) ASCII > strings will never be marked with a declared encoding, since their > representation is the same in all supported encodings". However, I later > realized that was not the case: if one creates an object that stores a UTF-8 > or Latin-1 string (e.g. x <- "á") then manually drops the encoding (i.e. > Encoding(x) <- ""), if that object (i.e. x) was passed to Rcpp its > get_encoding() would still return 0 (which suggests that CE_NATIVE > corresponds to the "unknown" label returned by the Encoding() command). > > Note that, in R's official documentation, nothing is said about CE_NATIVE > and, conversely, it is explicitly said that "Value CE_ANY is used to indicate > a character string that will not need re-encoding – this is used for > character strings known to be in ASCII, and can also be used as an input > parameter where the intention is that the string is treated as a series of > bytes". With this last bit of information in mind, I would then have expected > that strings containing simple 0-127 ASCII characters and no manually set > encoding, when passed to a Rcpp code would then have their get_encoding() > member function return 99 instead of 0 - hence making it easy to check within > Rcpp whether a string was ASCII only. That not being the case, my question > actually becomes two-folded: > > 1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for ASCII > only text?
I think this is because this is what R does; e.g. Encoding("ascii") =>. "unknown" As far as I can tell, CE_ANY is used only sparsely by R itself internally, and isn't really surfaced as a "public" encoding to be used. > 2) is there an established way to properly check within Rcpp whether a Rcpp > String is ASCII only (besides obviously looping over each character to check > if it's <128) just like it is done in R's C API with the IS_ASCII macro? I think this is the most reasonable way forward. If you need something more complicated or specific, I would honestly just recommend rolling your own class with the behaviors you need. If you think there's a way to make this happen with Rcpp's own String class, then a pull request would be welcomed. > Thanks, > > Andrade Solomon > > > _______________________________________________ > Rcpp-devel mailing list > Rcpp-devel@lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel _______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel