Hi Toby,

a defensive, portable approach would be to use only file names regarded portable by POSIX, so characters including ASCII letters, digits, underscore, dot, hyphen (but hyphen should not be the first character). That would always work on all systems and this is what I would use.

Individual operating systems and file systems and their configurations differ in which additional characters they support and how. On some, file names are just sequences of bytes, on some, they have to be valid strings in certain encoding (and then with certain exceptions).

On Windows, file names are at the lowest level in UTF-16LE encoding (and admitting unpaired surrogates for historical reasons). R stores strings in other encodings (UTF-8, native, Latin-1), so file names have to be translated to/from UTF-16LE, either directly by R or by Windows.

But, there is no way to convert (non-ASCII) strings in "C" encoding to UTF16-LE, so the examples cannot be made to work on Windows.

When the translation is left on Windows, it assumes the non-UTF-16LE strings are in the Active Code Page encoding (shown as "system encoding" in sessionInfo() in R, Latin-1 in your example) instead of the current C library encoding ("C" in your example). So, file names coming from Windows will be either the bytes of their UTF-16LE representation or the bytes of their Latin-1 representation, but which one is subject to the implementation details, so the result is really unusable.

I would say using "C" as encoding in R is not a good idea, and particularly not on Windows.

I would say that what happens with such file names in "C" encoding is unspecified behavior, which is subject to change at any time without notice, and that both the R 4.0.5 and R-devel behavior you are observing are acceptable. I don't think it should be mentioned in the NEWS. Personally, I would prefer some stricter checks of strings validity and perhaps disallowing the "C" encoding in R, so yet another behavior where it would be clearer that this cannot really work, but that would require more thought and effort.

Best
Tomas


On 4/27/21 9:53 PM, Toby Hocking wrote:

Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be fixed in
R-devel already. I checked on
https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and there is no
mention of these changes, so I'm wondering if they are intentional? If so,
could someone please add a mention of the bugfix in the NEWS?

The problem involves file.exists, on windows, when a long/strange input
file name Encoding is unknown, in C locale. I expected that FALSE should be
returned (and it is on R-devel), but I got an error in R-4.0.5. Code to
reproduce is:

x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
\360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
\360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
Encoding(x) <- "unknown"
Sys.setlocale(locale="C")
sessionInfo()
file.exists(x)

Output I got from R-4.0.5 was

sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] C
system code page: 1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.5
file.exists(x)
Error in file.exists(x) : file name conversion problem -- name too long?
Execution halted

Output I got from R-devel was

sessionInfo()
R Under development (unstable) (2021-04-26 r80229)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.2.0
file.exists(x)
[1] FALSE

I also observed similar results when using normalizePath instead of
file.exists (error in R-4.0.5, no error in R-devel).

normalizePath(x) #R-4.0.5
Error in path.expand(path) : unable to translate 'p'
| p'p;
| p'p<
| p'p=
| p'p>
| p'p<bf>
' to UTF-8
Calls: normalizePath -> path.expand
Execution halted

normalizePath(x) #R-devel
[1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
\360\237\247\222\360\237\217\273\n| \360\237\247\222\360\237\217\274\n|
\360\237\247\222\360\237\217\275\n| \360\237\247\222\360\237\217\276\n|
\360\237\247\222\360\237\217\277\n"
Warning message:
In normalizePath(path.expand(path), winslash, mustWork) : path[1]="🧒
| 🧒🏻
| 🧒🏼
| 🧒🏽
| 🧒🏾
| 🧒🏿
": The filename, directory name, or volume label syntax is incorrect

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to