Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"

Tomas Kalibera Tue, 30 Jun 2020 00:25:57 -0700

On 6/29/20 4:39 PM, Johannes Rauh wrote:

Dear R Developers,


I noticed that `basename` and `dirname` always return "UTF-8" on Windows 
(tested with R-4.0.0 and R-3.6.3):

p <- "Föö/Bär"
Encoding(p)

[1] "latin1"

Encoding(dirname(p))

[1] "UTF-8"

Encoding(basename(p))

[1] "UTF-8"

Is this on purpose?  At least I did not find any relevant comment in the 
documentation of `dirname`/`basename`.
Background: I'm currently struggeling with a directory name containing a 
latin1-character.  (I know that this is a bad idea, but I did not create the directory 
and I cannot rename it.)  I now want to pass a latin1-directory name to a function, which 
internally uses `tools::makeLazyLoadDB`.  At that point, internally, `dirname` is called, 
which changes the encoding, and things break.  If I use `debug` to halt the processing 
and "fix" the encoding, things work as expected.

So, if possible, I would prefer that `dirname` and `basename` preserve the 
encoding.

Please try to always submit a minimal reproducible example with yourreports and test with at least the latest released version of R, ideallyalso with R-devel.

As you have not sent a reproducible example, it is hard to tell forsure, but most likely as Kevin wrote you have run into a real bug, whichwas however already fixed in 4.0.2 and in R-devel (17833). The lazyloading cache did not work with file names in non-native encoding.

That real bug has been uncovered by legitimate and correct changes likethe ones you report, where file operations started returning non-ASCIIstrings in UTF-8. Historically in R such functions would instead returnnative strings with misrepresented characters, and we were reluctant tochange that expecting waking bugs in code silently assuming nativeencoding. Still, as people were increasingly running into problems withnon-representable characters, we did that change in several functionsanyway, and yes, it started waking up bugs.

With some performance overhead and added complexity, we could bereturning preferentially results in native encoding, and in UTF-8 onlywhen they included non-representable characters. That would increase thecode complexity, increase performance overhead, but wake up existingbugs with smaller probability. Note - some code that relied previouslyon best-fit conversions done by Windows will have been broken anyway. Wewould have to bypass win_iconv/iconv for that (adding more complexity).Bugs in code not handling encodings properly would still be triggeredvia non-representable characters. I've recently changed file.path() inR-devel to be slightly more conservative again, along these lines.

We can still do it more widely, but it is not high on the priority list.The way to fix all of these problems is switching to UTF-8 as nativeencoding on Windows and every day spent on tuning the existing behaviorpostpones that real solution.


Best
Tomas


Best regards
Johannes

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"

Reply via email to