Le 2011-08-15 à 22:24, Duncan Murdoch a écrit : > On 11-08-15 7:48 PM, Denis Chabot wrote: >> >> Le 2011-08-15 à 19:06, Duncan Murdoch a écrit : >> >>> On 11-08-15 2:42 PM, Denis Chabot wrote: >>>> Hi, >>>> >>>> I usually do not give second thought to accented vowels and R handles >>>> everything fine thanks to UTF8 being used in my R scripts. But today I >>>> have a problem. Accented vowels do not behave properly when they were >>>> imported into R using list.files. >>>> >>>> Maybe this is because OS X (I'm using 10.6.8) still uses MacRoman for >>>> file names, though visually the names seem to have been read correctly >>>> into R. >>>> >>>> An example is better than words: >>>> >>>> sessionInfo() >>>> R version 2.13.1 (2011-07-08) >>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>> >>>> locale: >>>> [1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8 >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> >>>> This does not cause problem: >>>> a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles Sète sda.Rda", "1_MO2 >>>> turbots po2crit.Rda"); a >>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda" "1_MO2 >>>> turbots po2crit.Rda" >>>> >>>> a2 = gsub(" Sète", "S", a); a2 >>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 >>>> turbots po2crit.Rda" >>>> >>>> >>>> but if instead of creating the vector within the R script, I read it as a >>>> series of file names, the substitution does not work. I am sorry that I >>>> cannot make this a reproducible example as it requires the 3 files to >>>> exist on your computer, but you could create 3 dummy files having the same >>>> names in the directory of your choice. >>>> >>>> don = file.path("données/") >>>> b = list.files(path = don, pattern = "1_MO2"); b >>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda" "1_MO2 >>>> turbots po2crit.Rda" >>>> >>>> b2 = gsub(" Sète", "S", b); b2 >>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda" "1_MO2 >>>> turbots po2crit.Rda" >>>> >>>> I am puzzled and also "stuck". For now I'll modify the file name, but I >>>> need to be able to handle such names at some point. >>>> >>>> Any advice? >>> >>> >>> Possibly your system really is using MacRoman or some other local encoding; >>> in that case, iconv(x, "", "UTF-8") should convert from the local encoding >>> to UTF-8. >>> >>> I think declaring everything to be UTF8 may be sufficient. When I use >>> list.files(), I see the encoding listed as "unknown", but >>> >>> x<- list.files() >>> Encoding(x)<- "UTF-8" >>> >>> works. However, the iconv() method should be safer. >>> >>> Duncan Murdoch >> >> Hi Duncan, >> >> iconv() confirmed what I suspected: there was no problem with the encoding >> of the result of list.files, and if there had been one, the "è" would not >> have looked like a "è". Therefore, I got nonsense when treating this "è" as >> MacRoman to be converted into UTF-8: >> >> iconv(b, from="MacRoman", to="UTF-8") >> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles SeÃÄte sda.Rda" "1_MO2 >> turbots po2crit.Rda" >> >> It is not clear however that R considered b to be UTF=8: >> Encoding(b) >> [1] "unknown" "unknown" "unknown" >> >> so I followed your suggestion: >> >> Encoding(b)<- "UTF-8" >> Encoding(b) >> [1] "unknown" "UTF-8" "unknown" >> >> but gsub still did not work: >> b2 = gsub(" Sète", "S", b); b2 >> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda" "1_MO2 >> turbots po2crit.Rda" >> >> I do not know why gsub worked with example "a" but not "b" in the example >> shown in my original message. Strange and frustrating. > > Unicode sometimes gives different ways to encode what is rendered as the same > character (e.g. letter + accent versus accented letter). I think (see below) > the OS uses one convention, but R chooses the other when it parses your text. > > Cut and paste did just work for me, in a version of R 2.13.0 Patched which > predates 2.13.1 by a few weeks; I'm not up to date on my Mac: > > > > x <- list.files() > > x > [1] "1_MO2 soles Sète sda.Rda" > > gsub("Sète", "XXXX", x) > [1] "1_MO2 soles XXXX sda.Rda" > > > > In the second line, I didn't try to type the pattern containing Sète, I just > cut and pasted it from the printed version of x. > > One other possibility (and perhaps it's the best one, if your substitutions > are all so simple) is to use the useBytes=TRUE option to gsub. You can use > charToRaw to see the bytes in a string, to make sure they are what you expect. > > When I do that, I see that the è really is handled differently in the two > cases: > > > charToRaw("Sète") # cut and paste from list.files() output > [1] 53 65 cc 80 74 65 > > charToRaw("Sète") # entered on the keyboard > [1] 53 c3 a8 74 65 > > So your solution is ugly: you'll need to code all your substitutions twice > (or more!) to handle all the possible ways the same letter could be encoded. > Or maybe iconv() or some other function has an option to normalize the > encoding. (I've just read some more about the issue in > http://en.wikipedia.org/wiki/Unicode_equivalence; normalization is what you > want to do, but I don't know how to do it.) > > Duncan Murdoch
Hi again Duncan, the "Errors due to normalization differences" part of the article you referred to seems to confirm your suspicion. I can get this to work but it is messy: Sètefileraw = charToRaw(substr(b[2],13,17)) Sètefile = rawToChar(Sètefileraw) Sètekbraw = charToRaw(substr(a[2],13,16)) Sètekb = rawToChar(Sètekbraw) c = b c = gsub(Sètefile, Sètekb, c) at this point, Sète has become the "keyboard" version and the rest of the script can work c2 = gsub(" Sète", "S", c); c2 [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots po2crit.Rda" I'll keep accented vowels out of file names for this project whenever I'll have to use gsub on them! Thanks again, Denis _______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac