Re: [Rcpp-devel] How to properly check if a String has only ASCII characters in Rcpp?

Kevin Ushey Thu, 18 Feb 2021 11:18:52 -0800

On Thu, Feb 18, 2021 at 1:36 AM Andrade Solomon
<andradesolomon2...@gmail.com> wrote:
>
> Dear list,
>
> I see here that Rccp strings have both a get_encoding() and a set_encoding() 
> member functions, which respectively return and accept a cetype_t enum 
> defined in Rinternals.h with options:
>     CE_NATIVE = 0,
>     CE_UTF8   = 1,
>     CE_LATIN1 = 2,
>     CE_BYTES  = 3,
>     CE_SYMBOL = 5,
>     CE_ANY = 99
>
> This means that if the String is UTF-8, Latin1 or Bytecode, the String's 
> get_encoding() member function will return 1, 2 or 3, respectively. 
> Experimentally, I see that when I try it with string objects containing only 
> 0 to 127 ASCII characters (with no manually set encoding), the get_encoding() 
> member function returns 0, which means CE_NATIVE in the aforementioned enum. 
> At first, I just assumed that, then, in Rcpp's eyes, ASCII would be 
> considered CE_NATIVE. This could even make sense with what is described in 
> R's Encoding() command help entry, i.e. that "character strings in R can be 
> declared to be encoded in 'latin1' or 'UTF-8' or as 'bytes' (...) ASCII 
> strings will never be marked with a declared encoding, since their 
> representation is the same in all supported encodings". However, I later 
> realized that was not the case: if one creates an object that stores a UTF-8 
> or Latin-1 string (e.g. x <- "á") then manually drops the encoding (i.e. 
> Encoding(x) <- ""), if that object (i.e. x) was passed to Rcpp its 
> get_encoding() would still return 0 (which suggests that CE_NATIVE 
> corresponds to the "unknown" label returned by the Encoding() command).
>
> Note that, in R's official documentation, nothing is said about CE_NATIVE 
> and, conversely, it is explicitly said that "Value CE_ANY is used to indicate 
> a character string that will not need re-encoding – this is used for 
> character strings known to be in ASCII, and can also be used as an input 
> parameter where the intention is that the string is treated as a series of 
> bytes". With this last bit of information in mind, I would then have expected 
> that strings containing simple 0-127 ASCII characters and no manually set 
> encoding, when passed to a Rcpp code would then have their get_encoding() 
> member function return 99 instead of 0 - hence making it easy to check within 
> Rcpp whether a string was ASCII only. That not being the case, my question 
> actually becomes two-folded:
>
> 1) why does Rcpp's get_encoding() apparently return 0 instead of 99 for ASCII 
> only text?


I think this is because this is what R does; e.g.

    Encoding("ascii")  =>. "unknown"

As far as I can tell, CE_ANY is used only sparsely by R itself
internally, and isn't really surfaced as a "public" encoding to be
used.

> 2) is there an established way to properly check within Rcpp whether a Rcpp 
> String is ASCII only (besides obviously looping over each character to check 
> if it's <128) just like it is done in R's C API with the IS_ASCII macro?

I think this is the most reasonable way forward. If you need something
more complicated or specific, I would honestly just recommend rolling
your own class with the behaviors you need.

If you think there's a way to make this happen with Rcpp's own String
class, then a pull request would be welcomed.

> Thanks,
>
> Andrade Solomon
>
>
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel@lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
_______________________________________________
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

Re: [Rcpp-devel] How to properly check if a String has only ASCII characters in Rcpp?

Reply via email to