Re: [R-SIG-Mac] accented vowels

Denis Chabot Tue, 16 Aug 2011 03:35:10 -0700

Le 2011-08-15 à 22:24, Duncan Murdoch a écrit :

> On 11-08-15 7:48 PM, Denis Chabot wrote:
>> 
>> Le 2011-08-15 à 19:06, Duncan Murdoch a écrit :
>> 
>>> On 11-08-15 2:42 PM, Denis Chabot wrote:
>>>> Hi,
>>>> 
>>>> I usually do not give second thought to accented vowels and R handles 
>>>> everything fine thanks to UTF8 being used in my R scripts. But today I 
>>>> have a problem. Accented vowels do not behave properly when they were 
>>>> imported into R using list.files.
>>>> 
>>>> Maybe this is because  OS X (I'm using 10.6.8) still uses MacRoman for 
>>>> file names, though visually the names seem to have been read correctly 
>>>> into R.
>>>> 
>>>> An example is better than words:
>>>> 
>>>> sessionInfo()
>>>> R version 2.13.1 (2011-07-08)
>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>> 
>>>> locale:
>>>> [1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
>>>> 
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>> 
>>>> 
>>>> This does not cause problem:
>>>> a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles Sète sda.Rda", "1_MO2 
>>>> turbots po2crit.Rda"); a
>>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"    "1_MO2 
>>>> turbots po2crit.Rda"
>>>> 
>>>> a2 = gsub(" Sète", "S", a); a2
>>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda"        "1_MO2 
>>>> turbots po2crit.Rda"
>>>> 
>>>> 
>>>> but if instead of creating the vector within the R script, I read it as a 
>>>> series of file names, the substitution does not work. I am sorry that I 
>>>> cannot make this a reproducible example as it requires the 3 files to 
>>>> exist on your computer, but you could create 3 dummy files having the same 
>>>> names in the directory of your choice.
>>>> 
>>>> don = file.path("données/")
>>>> b = list.files(path = don, pattern = "1_MO2"); b
>>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 
>>>> turbots po2crit.Rda"
>>>> 
>>>> b2 = gsub(" Sète", "S",  b); b2
>>>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 
>>>> turbots po2crit.Rda"
>>>> 
>>>> I am puzzled and also "stuck". For now I'll modify the file name, but I 
>>>> need to be able to handle such names at some point.
>>>> 
>>>> Any advice?
>>> 
>>> 
>>> Possibly your system really is using MacRoman or some other local encoding; 
>>> in that case, iconv(x, "", "UTF-8") should convert from the local encoding 
>>> to UTF-8.
>>> 
>>> I think declaring everything to be UTF8 may be sufficient.  When I use 
>>> list.files(), I see the encoding listed as "unknown", but
>>> 
>>> x<- list.files()
>>> Encoding(x)<- "UTF-8"
>>> 
>>> works.  However, the iconv() method should be safer.
>>> 
>>> Duncan Murdoch
>> 
>> Hi Duncan,
>> 
>> iconv() confirmed what I suspected: there was no problem with the encoding 
>> of the result of list.files, and if there had been one, the "è" would not 
>> have looked like a "è". Therefore, I got nonsense when treating this "è" as 
>> MacRoman to be converted into UTF-8:
>> 
>> iconv(b, from="MacRoman", to="UTF-8")
>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles SeÃÄte sda.Rda"  "1_MO2 
>> turbots po2crit.Rda"
>> 
>> It is not clear however that R considered b to be UTF=8:
>> Encoding(b)
>> [1] "unknown" "unknown" "unknown"
>> 
>> so I followed your suggestion:
>> 
>> Encoding(b)<- "UTF-8"
>> Encoding(b)
>> [1] "unknown" "UTF-8"   "unknown"
>> 
>> but gsub still did not work:
>> b2 = gsub(" Sète", "S",  b); b2
>> [1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Sète sda.Rda"     "1_MO2 
>> turbots po2crit.Rda"
>> 
>> I do not know why gsub worked with example "a" but not "b" in the example 
>> shown in my original message. Strange and frustrating.
> 
> Unicode sometimes gives different ways to encode what is rendered as the same 
> character (e.g. letter + accent versus accented letter).  I think (see below) 
> the OS uses one convention, but R chooses the other when it parses your text.
> 
> Cut and paste did just work for me, in a version of R 2.13.0 Patched which 
> predates 2.13.1 by a few weeks; I'm not up to date on my Mac:
> 
> 
> > x <- list.files()
> > x
> [1] "1_MO2 soles Sète sda.Rda"
> > gsub("Sète", "XXXX", x)
> [1] "1_MO2 soles XXXX sda.Rda"
> 
> 
> 
> In the second line, I didn't try to type the pattern containing Sète, I just 
> cut and pasted it from the printed version of x.
> 
> One other possibility (and perhaps it's the best one, if your substitutions 
> are all so simple) is to use the useBytes=TRUE option to gsub.  You can use 
> charToRaw to see the bytes in a string, to make sure they are what you expect.
> 
> When I do that, I see that the è really is handled differently in the two 
> cases:
> 
> > charToRaw("Sète") # cut and paste from list.files() output
> [1] 53 65 cc 80 74 65
> > charToRaw("Sète") # entered on the keyboard
> [1] 53 c3 a8 74 65
> 
> So your solution is ugly:  you'll need to code all your substitutions twice 
> (or more!) to handle all the possible ways the same letter could be encoded.  
> Or maybe iconv() or some other function has an option to normalize the 
> encoding.  (I've just read some more about the issue in 
> http://en.wikipedia.org/wiki/Unicode_equivalence; normalization is what you 
> want to do, but I don't know how to do it.)
> 
> Duncan Murdoch


Hi again Duncan,

the "Errors due to normalization differences" part of the article you referred 
to seems to confirm your suspicion. 

I can get this to work but it is messy:

Sètefileraw = charToRaw(substr(b[2],13,17))
Sètefile = rawToChar(Sètefileraw)

Sètekbraw = charToRaw(substr(a[2],13,16))
Sètekb = rawToChar(Sètekbraw)

c = b
c = gsub(Sètefile, Sètekb, c)
at this point, Sète has become the "keyboard" version and the rest of the 
script can work
c2 = gsub(" Sète", "S", c); c2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda"        "1_MO2 turbots 
po2crit.Rda"  

I'll keep accented vowels out of file names for this project whenever I'll have 
to use gsub on them!

Thanks again,

Denis
_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Re: [R-SIG-Mac] accented vowels

Reply via email to