Dear list, I see here <https://dirk.eddelbuettel.com/code/rcpp/html/classRcpp_1_1String.html> that Rccp strings have both a get_encoding() and a set_encoding() member functions, which respectively return and accept a cetype_t enum defined in Rinternals.h <https://github.com/wch/r-source/blob/bf0a0a9d12f2ce5d66673dc32cd253524f3270bf/src/include/Rinternals.h#L928-L935> with options: CE_NATIVE = 0, CE_UTF8 = 1, CE_LATIN1 = 2, CE_BYTES = 3, CE_SYMBOL = 5, CE_ANY = 99
This means that if the String is UTF-8, Latin1 or Bytecode, the String's get_encoding() member function will return 1, 2 or 3, respectively. Experimentally, I see that when I try it with string objects containing only 0 to 127 ASCII characters (with no manually set encoding), the get_encoding() member function returns 0, which means CE_NATIVE in the aforementioned enum. At first, I just assumed that, then, in Rcpp's eyes, ASCII would be considered CE_NATIVE. This could even make sense with what is described in R's Encoding() command help entry <https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Encoding>, i.e. that "character strings in R can be declared to be encoded in 'latin1' or 'UTF-8' or as 'bytes' (...) ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings". However, I later realized that was not the case: if one creates an object that stores a UTF-8 or Latin-1 string (e.g. x <- "á") then manually drops the encoding (i.e. Encoding(x) <- ""), if that object (i.e. x) was passed to Rcpp its get_encoding() would still return 0 (which suggests that CE_NATIVE corresponds to the "unknown" label returned by the Encoding() command). Note that, in R's official documentation <https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Character-encoding-issues>, nothing is said about CE_NATIVE and, conversely, it is explicitly said that "Value CE_ANY is used to indicate a character string that will not need re-encoding – this is used for character strings known to be in ASCII, and can also be used as an input parameter where the intention is that the string is treated as a series of bytes". With this last bit of information in mind, I would then have expected that strings containing simple 0-127 ASCII characters and no manually set encoding, when passed to a Rcpp code would then have their get_encoding() member function return 99 instead of 0 - hence making it easy to check within Rcpp whether a string was ASCII only. That not being the case, my question actually becomes two-folded: 1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for ASCII only text? 2) is there an established way to properly check within Rcpp whether a Rcpp String is ASCII only (besides obviously looping over each character to check if it's <128) just like it is done in R's C API with the IS_ASCII macro? Thanks, Andrade Solomon
_______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel