Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Tomáš Bořil Thu, 11 Apr 2019 00:12:09 -0700

Or, if this cannot be done easily, please, disable the "utf-8" value
in source(..., ) function on Windows R.
source(..., encoding = "utf-8")
-> error: "utf-8" does not work right on Windows.
-> (or, at least) warning: "utf-8" is handled by "best fit" on Windows
and some characters in string literals may be automatically changed.


Because, at this state, the UTF-8 encoding of R source files on
Windows is a fake Unicode as it can handle only 256 different ANSI
characters in reality.

Thanks,
Tomas


On Thu, Apr 11, 2019 at 8:53 AM Tomáš Bořil <bor...@gmail.com> wrote:
>
> For me, this would be a perfect solution.
>
> I.e., do not use the “best” fit and leave it to user’s competence:
> a) in some functions, utf-8 works
> b) in others -> error is thrown (e.g., incomplete string, NA, etc.)
> => user has to change the code with his/her intentional “best fit string 
> literal substitute” or use another function that can handle utf-8.
>
> Making an R code working right only on some platforms / trying to keep a 
> back-compatibility meaning “the code does not do what you want and the 
> behaviour differs depending on each every locale but at least, it does not 
> throw an error” is generally not a good idea - it is dangerous. Users / 
> coders should know that there is something wrong with their strings and some 
> characters are “eaten alive”.
>
> Tomas
>
> čt 11. 4. 2019 v 8:26 odesílatel Tomas Kalibera <tomas.kalib...@gmail.com> 
> napsal:
>>
>> On 4/10/19 6:32 PM, Jeroen Ooms wrote:
>> > On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.dun...@gmail.com> 
>> > wrote:
>> >> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
>> >>> Since it is "technically easy" to disable the best fit conversion and
>> >>> the best fit is rarely good, how about providing an option for
>> >>> code/package authors to disable it? I'm asking because this is one of
>> >>> the most painful issues in packages that may need to source() code
>> >>> containing UTF-8 characters that are not representable in the Windows
>> >>> native encoding. Examples include knitr/rmarkdown and shiny. Basically
>> >>> users won't be able to knit documents or run Shiny apps correctly when
>> >>> the code contains characters that cannot be represented in the native
>> >>> encoding.
>> >> Wouldn't things be worse with it disabled than currently?  I'd expect
>> >> the line containing the "ř" to end up as NA instead of converting to "r".
>> > I don't think it would be worse, because in this case R would not
>> > implicitly convert strings to (best fit) latin1 on Windows, but
>> > instead keep the (correct) string in its UTF-8 encoding. The NA only
>> > appears if the user explicitly forces a conversion to latin1, which is
>> > not the problem here I think.
>> >
>> > The original problem that I can reproduce in RGui is that if you enter
>> >   "ř" in RGui, R opportunistically converts this to latin1, because it
>> > can. However if you enter text which can definitely not be represented
>> > in latin1, R encodes the string correctly in UTF-8 form.
>>
>> Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to
>> convert the input to native encoding before passing it to R, which is
>> based on locales. However, that string is passed by R to the parser,
>> which Rgui takes advantage of and converts non-representable characters
>> to their \uxxxx escapes which are understood by the parser. Using this
>> trick, Unicode characters can get to the parser from Rgui (but of course
>> then still in risk of conversion later when the program runs). Rgui only
>> escapes characters that cannot be represented, unfortunately, the
>> standard C99 API for that implemented on Windows does the best fit. This
>> could be fixed in Rgui by calling a special Windows API function and
>> could be done, but with the mentioned risk that it would break existing
>> uses that capture the existing behavior.
>>
>> This is the only place I know of where removing best fit would lead to
>> correct representation of UTF-8 characters. Other places will give NA,
>> some other escapes, code will fail to parse (e.g. "incomplete string",
>> one can get that easily with source()).
>>
>> Tomas
>>
>> ______________________________________________
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Reply via email to