Just to follow up in case anyone other than me is using Unicode in R:  Rcpp 
does not support Unicode, or really any encoding other than 7 bit ascii.  
Internally R marks every string with an encoding, typically UTF8, Latin1 or 
ASCII.  When using as<string> Rcpp just copies the bytes over ignoring the 
encoding.  This means that if you take a string that was utf8 and then later 
wrap it again, the encoding info is lost and the characters get corrupted.  In 
particular, never use Rcpp::as<std::wstring> because the string gets widened 
without being converted to Unicode.

If you want (or need) to support Unicode text in an R plugin, you need to use 
Rf_translateCharUTF8(...) to get a string.  Regardless of what encoding it was 
originally, R will make sure it is encoded as UTF-8.  In order to set a string 
into a R object you have to use the corresponding Rf_mkCharLenCE(p, len, 
CE_UTF8) function - which tells R that the data you have is UTF-8.

Ned.

From: rcpp-devel-boun...@lists.r-forge.r-project.org 
[mailto:rcpp-devel-boun...@lists.r-forge.r-project.org] On Behalf Of Ned Harding
Sent: Wednesday, June 26, 2013 11:54 AM
To: rcpp-devel@lists.r-forge.r-project.org
Subject: [Rcpp-devel] Unicode on windows

I am having issues with the wide string conversion to and from Rcpp.  When 
taking in a string from R that is encoding UTF-8, I would expect as<wstring> to 
have converted the utf-8 to a wide string.  Instead, it is just widening all 
the characters and leaving the UTF-8 encoding.  I have no issue with UTF-8, but 
my issue is that Rcpp doesn't seem to be able to tell me what encoding the 
source is so I don't know if I should convert or not.

Similarly, I would expect that wrap<wstring> would produce a UTF-8 encoding 
SEXP, but instead the encoding in R comes back "Unknown" and the data can't 
print.  See The C++ & R sources below along with the output.

C++ function
----------------------------------------
RcppExport SEXP TestWide(SEXP _strIn)
{
                std::wstring strIn = Rcpp::as<std::wstring>(_strIn);
                for (const wchar_t *p = strIn.c_str(); *p; ++p)
                                Rprintf("%x\n", *p);
                std::wstring str = L"a\x02a5c";
                return Rcpp::wrap(str);
}

R Script
----------------------------------------
test <- "a\u02a5b"
a<-.Call( "TestWide", test, PACKAGE = "AlteryxRDataX" )
print(Encoding(a))
print(a)

R Output
----------------------------------------
R version 3.0.0 (2013-04-03) - x86_64
rgeos version: 0.2-16, (SVN revision 389)
GEOS runtime version: 3.3.6-CAPI-1.7.6
Polygon checking: TRUE
61
ffca
ffa5
62
"unknown"
"a?"

Thanks,

Ned Harding
Alteryx
CTO
3825 Iris Avenue, Suite 150
Boulder, CO 80301
Phone:  720-259-0541
eMail:   n...@alteryx.com<mailto:n...@alteryx.com>

_______________________________________________
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

Reply via email to