Just to follow up in case anyone other than me is using Unicode in R: Rcpp does not support Unicode, or really any encoding other than 7 bit ascii. Internally R marks every string with an encoding, typically UTF8, Latin1 or ASCII. When using as<string> Rcpp just copies the bytes over ignoring the encoding. This means that if you take a string that was utf8 and then later wrap it again, the encoding info is lost and the characters get corrupted. In particular, never use Rcpp::as<std::wstring> because the string gets widened without being converted to Unicode.
If you want (or need) to support Unicode text in an R plugin, you need to use Rf_translateCharUTF8(...) to get a string. Regardless of what encoding it was originally, R will make sure it is encoded as UTF-8. In order to set a string into a R object you have to use the corresponding Rf_mkCharLenCE(p, len, CE_UTF8) function - which tells R that the data you have is UTF-8. Ned. From: rcpp-devel-boun...@lists.r-forge.r-project.org [mailto:rcpp-devel-boun...@lists.r-forge.r-project.org] On Behalf Of Ned Harding Sent: Wednesday, June 26, 2013 11:54 AM To: rcpp-devel@lists.r-forge.r-project.org Subject: [Rcpp-devel] Unicode on windows I am having issues with the wide string conversion to and from Rcpp. When taking in a string from R that is encoding UTF-8, I would expect as<wstring> to have converted the utf-8 to a wide string. Instead, it is just widening all the characters and leaving the UTF-8 encoding. I have no issue with UTF-8, but my issue is that Rcpp doesn't seem to be able to tell me what encoding the source is so I don't know if I should convert or not. Similarly, I would expect that wrap<wstring> would produce a UTF-8 encoding SEXP, but instead the encoding in R comes back "Unknown" and the data can't print. See The C++ & R sources below along with the output. C++ function ---------------------------------------- RcppExport SEXP TestWide(SEXP _strIn) { std::wstring strIn = Rcpp::as<std::wstring>(_strIn); for (const wchar_t *p = strIn.c_str(); *p; ++p) Rprintf("%x\n", *p); std::wstring str = L"a\x02a5c"; return Rcpp::wrap(str); } R Script ---------------------------------------- test <- "a\u02a5b" a<-.Call( "TestWide", test, PACKAGE = "AlteryxRDataX" ) print(Encoding(a)) print(a) R Output ---------------------------------------- R version 3.0.0 (2013-04-03) - x86_64 rgeos version: 0.2-16, (SVN revision 389) GEOS runtime version: 3.3.6-CAPI-1.7.6 Polygon checking: TRUE 61 ffca ffa5 62 "unknown" "a?" Thanks, Ned Harding Alteryx CTO 3825 Iris Avenue, Suite 150 Boulder, CO 80301 Phone: 720-259-0541 eMail: n...@alteryx.com<mailto:n...@alteryx.com>
_______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel