Hi Toby, On 4/28/21 4:21 PM, Toby Hocking wrote: > Hi Tomas, thanks for the thoughtful reply. That makes sense about the > problems with C locale on windows. Actually I did not choose to use C > locale, but instead it was invoked automatically during a package check.
I see, as long as the tests only have ASCII strings, the encoding does not matter, but once there are also other characters, I think we should be running with some real encoding, and one where the characters can be represented. Best, Tomas > To be clear, I do NOT have a file with that name, but I do want > file.exists to return a reasonable value, FALSE (with no error). If > that behavior is unspecified, then should I use something like > tryCatch(file.exists(x), error=function(e)FALSE) instead of assuming > that file.exists will always return a logical vector without error? > For my particular application that work-around should probably be > sufficient, but one may imagine a situation where you want to do > > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n| > \360\237\247\222\360\237\217\274\n| > \360\237\247\222\360\237\217\275\n| > \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n" > Encoding(x) <- "unknown" > Sys.setlocale(locale="C") > f <- tempfile() > cat("", file = f) > two <- c(x, f) > file.exists(two) > > and in that case the correct response from R, in my opinion, would be > c(FALSE, TRUE) -- not an error. > Toby > > On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera > <tomas.kalib...@gmail.com <mailto:tomas.kalib...@gmail.com>> wrote: > > Hi Toby, > > a defensive, portable approach would be to use only file names > regarded > portable by POSIX, so characters including ASCII letters, digits, > underscore, dot, hyphen (but hyphen should not be the first > character). > That would always work on all systems and this is what I would use. > > Individual operating systems and file systems and their > configurations > differ in which additional characters they support and how. On some, > file names are just sequences of bytes, on some, they have to be > valid > strings in certain encoding (and then with certain exceptions). > > On Windows, file names are at the lowest level in UTF-16LE > encoding (and > admitting unpaired surrogates for historical reasons). R stores > strings > in other encodings (UTF-8, native, Latin-1), so file names have to be > translated to/from UTF-16LE, either directly by R or by Windows. > > But, there is no way to convert (non-ASCII) strings in "C" > encoding to > UTF16-LE, so the examples cannot be made to work on Windows. > > When the translation is left on Windows, it assumes the non-UTF-16LE > strings are in the Active Code Page encoding (shown as "system > encoding" > in sessionInfo() in R, Latin-1 in your example) instead of the > current C > library encoding ("C" in your example). So, file names coming from > Windows will be either the bytes of their UTF-16LE representation > or the > bytes of their Latin-1 representation, but which one is subject to > the > implementation details, so the result is really unusable. > > I would say using "C" as encoding in R is not a good idea, and > particularly not on Windows. > > I would say that what happens with such file names in "C" encoding is > unspecified behavior, which is subject to change at any time without > notice, and that both the R 4.0.5 and R-devel behavior you are > observing > are acceptable. I don't think it should be mentioned in the NEWS. > Personally, I would prefer some stricter checks of strings > validity and > perhaps disallowing the "C" encoding in R, so yet another behavior > where > it would be clearer that this cannot really work, but that would > require > more thought and effort. > > Best > Tomas > > > On 4/27/21 9:53 PM, Toby Hocking wrote: > > > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be > fixed in > > R-devel already. I checked on > > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS > <https://developer.r-project.org/blosxom.cgi/R-devel/NEWS> and > there is no > > mention of these changes, so I'm wondering if they are > intentional? If so, > > could someone please add a mention of the bugfix in the NEWS? > > > > The problem involves file.exists, on windows, when a > long/strange input > > file name Encoding is unknown, in C locale. I expected that > FALSE should be > > returned (and it is on R-devel), but I got an error in R-4.0.5. > Code to > > reproduce is: > > > > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n| > > \360\237\247\222\360\237\217\274\n| > \360\237\247\222\360\237\217\275\n| > > \360\237\247\222\360\237\217\276\n| > \360\237\247\222\360\237\217\277\n" > > Encoding(x) <- "unknown" > > Sys.setlocale(locale="C") > > sessionInfo() > > file.exists(x) > > > > Output I got from R-4.0.5 was > > > >> sessionInfo() > > R version 4.0.5 (2021-03-31) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > Running under: Windows 10 x64 (build 19042) > > > > Matrix products: default > > > > locale: > > [1] C > > system code page: 1252 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > loaded via a namespace (and not attached): > > [1] compiler_4.0.5 > >> file.exists(x) > > Error in file.exists(x) : file name conversion problem -- name > too long? > > Execution halted > > > > Output I got from R-devel was > > > >> sessionInfo() > > R Under development (unstable) (2021-04-26 r80229) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > Running under: Windows 10 x64 (build 19042) > > > > Matrix products: default > > > > locale: > > [1] C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > loaded via a namespace (and not attached): > > [1] compiler_4.2.0 > >> file.exists(x) > > [1] FALSE > > > > I also observed similar results when using normalizePath instead of > > file.exists (error in R-4.0.5, no error in R-devel). > > > >> normalizePath(x) #R-4.0.5 > > Error in path.expand(path) : unable to translate 'p' > > | p'p; > > | p'p< > > | p'p= > > | p'p> > > | p'p<bf> > > ' to UTF-8 > > Calls: normalizePath -> path.expand > > Execution halted > > > >> normalizePath(x) #R-devel > > [1] "C:\\Users\\th798\\R\\\360\237\247\222\n| > > \360\237\247\222\360\237\217\273\n| > \360\237\247\222\360\237\217\274\n| > > \360\237\247\222\360\237\217\275\n| > \360\237\247\222\360\237\217\276\n| > > \360\237\247\222\360\237\217\277\n" > > Warning message: > > In normalizePath(path.expand(path), winslash, mustWork) : > path[1]="🧒 > > | 🧒🏻 > > | 🧒🏼 > > | 🧒🏽 > > | 🧒🏾 > > | 🧒🏿 > > ": The filename, directory name, or volume label syntax is incorrect > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > <https://stat.ethz.ch/mailman/listinfo/r-devel> > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel