Re: [R-SIG-Mac] accented vowels

Duncan Murdoch Mon, 15 Aug 2011 19:25:42 -0700

On 11-08-15 7:48 PM, Denis Chabot wrote:


Le 2011-08-15 à 19:06, Duncan Murdoch a écrit :

On 11-08-15 2:42 PM, Denis Chabot wrote:

Hi,

I usually do not give second thought to accented vowels and R handles 
everything fine thanks to UTF8 being used in my R scripts. But today I have a 
problem. Accented vowels do not behave properly when they were imported into R 
using list.files.

Maybe this is because  OS X (I'm using 10.6.8) still uses MacRoman for file 
names, though visually the names seem to have been read correctly into R.

An example is better than words:

sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base


This does not cause problem:
a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles Sète sda.Rda", "1_MO2 turbots 
po2crit.Rda"); a
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"    "1_MO2 turbots 
po2crit.Rda"

a2 = gsub(" Sète", "S", a); a2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda"        "1_MO2 turbots 
po2crit.Rda"


but if instead of creating the vector within the R script, I read it as a 
series of file names, the substitution does not work. I am sorry that I cannot 
make this a reproducible example as it requires the 3 files to exist on your 
computer, but you could create 3 dummy files having the same names in the 
directory of your choice.

don = file.path("données/")
b = list.files(path = don, pattern = "1_MO2"); b
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 turbots 
po2crit.Rda"

b2 = gsub(" Sète", "S",  b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 turbots 
po2crit.Rda"

I am puzzled and also "stuck". For now I'll modify the file name, but I need to 
be able to handle such names at some point.

Any advice?



Possibly your system really is using MacRoman or some other local encoding; in that case, iconv(x, 
"", "UTF-8") should convert from the local encoding to UTF-8.

I think declaring everything to be UTF8 may be sufficient.  When I use list.files(), I 
see the encoding listed as "unknown", but

x<- list.files()
Encoding(x)<- "UTF-8"

works.  However, the iconv() method should be safer.

Duncan Murdoch


Hi Duncan,

iconv() confirmed what I suspected: there was no problem with the encoding of the result of list.files, and 
if there had been one, the "è" would not have looked like a "è". Therefore, I got 
nonsense when treating this "è" as MacRoman to be converted into UTF-8:

iconv(b, from="MacRoman", to="UTF-8")
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles SeÃÄte sda.Rda"  "1_MO2 turbots 
po2crit.Rda"

It is not clear however that R considered b to be UTF=8:
Encoding(b)
[1] "unknown" "unknown" "unknown"

so I followed your suggestion:

Encoding(b)<- "UTF-8"
Encoding(b)
[1] "unknown" "UTF-8"   "unknown"

but gsub still did not work:
b2 = gsub(" Sète", "S",  b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 turbots 
po2crit.Rda"

I do not know why gsub worked with example "a" but not "b" in the example shown 
in my original message. Strange and frustrating.

Unicode sometimes gives different ways to encode what is rendered as thesame character (e.g. letter + accent versus accented letter). I think(see below) the OS uses one convention, but R chooses the other when itparses your text.

Cut and paste did just work for me, in a version of R 2.13.0 Patchedwhich predates 2.13.1 by a few weeks; I'm not up to date on my Mac:



> x <- list.files()
> x
[1] "1_MO2 soles Sète sda.Rda"
> gsub("Sète", "XXXX", x)
[1] "1_MO2 soles XXXX sda.Rda"

In the second line, I didn't try to type the pattern containing Sète, Ijust cut and pasted it from the printed version of x.

One other possibility (and perhaps it's the best one, if yoursubstitutions are all so simple) is to use the useBytes=TRUE option togsub. You can use charToRaw to see the bytes in a string, to make surethey are what you expect.

When I do that, I see that the è really is handled differently in thetwo cases:


> charToRaw("Sète") # cut and paste from list.files() output
[1] 53 65 cc 80 74 65
> charToRaw("Sète") # entered on the keyboard
[1] 53 c3 a8 74 65

So your solution is ugly: you'll need to code all your substitutionstwice (or more!) to handle all the possible ways the same letter couldbe encoded. Or maybe iconv() or some other function has an option tonormalize the encoding. (I've just read some more about the issue inhttp://en.wikipedia.org/wiki/Unicode_equivalence; normalization is whatyou want to do, but I don't know how to do it.)


Duncan Murdoch

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Re: [R-SIG-Mac] accented vowels

Reply via email to