Tomas, > In my scenario, the conversion is invoked by RGui before returning the input to the main R loop, even before the input gets to the parser. In principle, we could change this particular conversion in RGui to avoid the substitution.
Not sure whether I am missing something here, but I used RStudio for my examples (I should have said) and David's mentioned RStudio as well, so it does not seem to be a problem with RGui only. Another example for the "best fit" behaviour seems to be "Σ" ("\u03A3", greek capital letter sigma, not "\u2211", n-ary summation): print("Σ") #> [1] "S" Again with cp1252 on Windows 10, R 3.5.2, RStudio 1.2.1256 preview. > even though we could rewrite in principle all calls to Windows API to use Unicode and have all strings in UTF-8 in R, we would still have problems when interfacing with packages that assume strings are in current native encoding (without checking), so this problem won't be easy to fix. Since I regularly encounter the reverse problem, i.e. packages that assume strings are in UTF-8 encoding without checking (which isn't very surprising, assuming that most package developers develop on Unix/macOS systems), I'd say, "rip of the bandaid rather sooner than later". Obviously I don't know how many bugs would surface in packages if R for Windows' native encoding were to switch to UTF-8, but these bugs would only be transitory, I suppose. Whereas there is a steady inflow of assume-UTF-8-encoding-bugs in new packages and functions with the current situation. Best, Daniel Am Fr., 8. Feb. 2019 um 13:07 Uhr schrieb Tomas Kalibera < tomas.kalib...@gmail.com>: > I can reproduce this behavior on my Windows 10 system in RGui (cp1252): > when I paste the Unicode infinity symbol into the console, it is treated > as number 8. This is caused by Windows "best fit" default behavior in > conversion of unicode characters to characters in the current native > encoding: at some point in the past, 8 has been chosen as a good fit for > infinity in Windows. In my scenario, the conversion is invoked by RGui > before returning the input to the main R loop, even before the input > gets to the parser. In principle, we could change this particular > conversion in RGui to avoid the substitution. RGui uses "\uxxxx" escapes > to pass characters that cannot be represented, this is why e.g. the > Cyrillic Zhe \u0436 worked, so we could tell Windows not to do the > substitution and pass "\u221e" for Infinity, and then the string after > being processed by the parser will be represented in UTF-8 inside R and > could be e.g. printed by the RGui console. That is something that could > be considered, but it will not solve the main problem and it may > actually cause trouble to users who are used to such substitutions > (especially when the substitutions are more intuitive, but, that may be > a matter of opinion). > > The main problem is that in normal use, sooner or later R will get to > the point when it will need to do the conversion to native encoding, and > in some context where "\uxxxx" escapes will not be possible. One cannot > reliably work with strings in R that cannot be represented in the > current native encoding (except when one knows precisely how to avoid > the conversion in some specific task, but that may be brittle; so the > best-fit substitution might in principle help here). This problem does > not exist on Unix/macOS systems where the current native encoding is > UTF-8 these days, so today it only exists on Windows where UTF-8 cannot > be the current native encoding. As has been discussed before, even > though we could rewrite in principle all calls to Windows API to use > Unicode and have all strings in UTF-8 in R, we would still have problems > when interfacing with packages that assume strings are in current native > encoding (without checking), so this problem won't be easy to fix. > > Best, > Tomas > > On 2/7/19 3:10 PM, Daniel Possenriede wrote: > > There seems to be something odd with "∞" on Windows (and not only with > > read.table) > > In native encoding (cp-1252 in my case), "∞" gets converted to "8" > > > > x <- "∞" > > Encoding(x) > > #> [1] "unknown" > > print(x) > > #> [1] "8" > > charToRaw(x) > > #> [1] 38 > > > > "∞" is indeed "8" > > > > identical(x, "8") > > #> [1] TRUE > > > > Everything seems fine if "∞" is UTF-8 encoded. > > > > y <- "\u221E" > > Encoding(y) > > #> [1] "UTF-8" > > print(y) > > #> [1] "∞" > > charToRaw(y) > > #> [1] e2 88 9e > > > > Unless the string is converted back to native encoding. > > > > format(y) > > #> [1] "8" > > > > This ought to be "<U+221E>", equivalently to > > > > format("∝") > > #> [1] "<U+221D>" > > > > Session Info: > > > > si <- sessionInfo() > > si$running > > #> [1] "Windows 10 x64 (build 17134)" > > si$R.version$version.string > > #> [1] "R version 3.5.2 (2018-12-20)" > > si$locale > > #> [1] > > > "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252" > > > > > > > > Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne < > > david.byrne...@gmail.com>: > > > >> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is > >> most likely correct; it looks like its Windows specific. > >> > >> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pda...@gmail.com> wrote: > >>> This doesn't seem to be happening on MacOS, neither in Terminal nor > >> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. > >>> -pd > >>> > >>>> On 7 Feb 2019, at 11:17 , David Byrne <david.byrne...@gmail.com> > >> wrote: > >>>> Bug > >>>> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > >>>> file containing the infinity symbol (' ∞ ') results in the infinity > >>>> symbol imported as the number 8. Other Unicode characters seem > >>>> unaffected, example, Zhe: ж > >>>> > >>>> Expected Behavior: > >>>> The imported data.frame should represent the infinity symbol as the > >>>> expected 'Inf' so that normal mathematical operations can be processed > >>>> > >>>> Stack Overflow Post: > >>>> I created a question on Stack Overflow where one other member was able > >>>> to reproduce the same issues I was having. This question can be found > >>>> at: > >>>> > >> > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > >>>> Method to Reproduce - 1: > >>>> A simple method to reproduce this issues is to use R-Studio: In the > >>>> console, type the following: > >>>>> read.table(text=" ∞", encoding="UTF-8") > >>>> The result should be a data.frame with a single value of '8' > >>>> > >>>> Repeating the same with ж Results in correct expected behavior > >>>> > >>>> Method to Reproduce - 2: > >>>> Create a .csv file containing the infinity and Zhe characters (I have > >>>> attached the file for convenience, hopefully it is no rejected by your > >>>> email service). Launch an interactive session using > >>>> > >>>>> r --vanilla > >>>> Enter the following statement taking care to replace the > >>>> <path-to-file> with the appropriate one: > >>>> > >>>>> read.table("<path-to-file>/unicode_chars.csv", sep=",", > >> encoding="UTF-8") > >>>> > >>>> This should result in a two element data.frame; the first being the > >>>> incorrect value of 8 with an additional <U+FEFF> and the second the > >>>> correct value of Zhe. > >>>> > >>>> Note the additional <U+FEFF> prefixed to the front of the '8'. This > >>>> appears to be a hidden character for the purposes of letting editors > >>>> know the encoding. The following link has some explanation however, it > >>>> states this is caused by excel. The file I created was done so using > >>>> notepad and not Excel. > >>>> > >>>> > >> > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > >>>> System Details: > >>>> OS: > >>>>> Windows 10.0.17134 Build 17134 > >>>> > >>>> R Version: > >>>>> platform x86_64-w64-mingw32 > >>>>> arch x86_64 > >>>>> os mingw32 > >>>>> system x86_64, mingw32 > >>>>> status > >>>>> major 3 > >>>>> minor 4.1 > >>>>> year 2017 > >>>>> month 06 > >>>>> day 30 > >>>>> svn rev 72865 > >>>>> language R > >>>>> version.string R version 3.4.1 (2017-06-30) > >>>>> nickname Single Candle > >>>> ______________________________________________ > >>>> R-devel@r-project.org mailing list > >>>> https://stat.ethz.ch/mailman/listinfo/r-devel > >>> -- > >>> Peter Dalgaard, Professor, > >>> Center for Statistics, Copenhagen Business School > >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark > >>> Phone: (+45)38153501 > >>> Office: A 4.23 > >>> Email: pd....@cbs.dk Priv: pda...@gmail.com > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >> ______________________________________________ > >> R-devel@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel > >> > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel