iconv(xU, "UTF-8", "", sub = "byte")
[1] "Ekstr<c3><b8>m" "J<c3><b6>reskog" "bi<c3><9f>chen Z¨¹rcher"
Sys.setlocale("LC_CTYPE", "Arabic")
[1] "Arabic_Saudi Arabia.1256"
iconv(x1, "latin1", "") # NA NA NA
[1] NA NA NA
iconv(xU, "UTF-8", "") # NA NA NA
[1] NA NA NA
iconv(xU, "UTF-8", "//TRANSLIT")
[1] "Ekstr\370m" "J\366reskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "", sub="byte")
[1] "Ekstr<c3><b8>m" "J<c3><b6>reskog" "bi<c3><9f>chen Zürcher"
iconv(xU, "UTF-8", "", sub="?")
[1] "Ekstr??m" "J??reskog" "bi??chen Zürcher"
Etc... . As the above is typically garbled between e-mail
transfer agents, I append both the iconv-Windows.R R script and
the corresponding iconv-Windows.Rout R transcript to this
e-mail (using MIME type text/plain (easy using emacs for mail..)),
and they contain a bit more than the above.
Note that the above shows that using 'sub = *' and using
"//TRANSLIT" in case of a previous NA result helps quite a bit,
in the sense that it gives much more information to see
"J?reskog" instead NA.
I'm considering updating packageDescription() to try these in
case it first returns NA. This would make the citation() hack
unnecessary.
Martin
iconv-Windows.R
#### iconv() behavior depending on Locales LC_CTYPE in Windows
#### ======= ==============================
###
### In a *shell* in Windows (emacs), after doing R.home() in R, use that to do
something like
### c:/PROGRA~1/R/R-devel/bin/R CMD BATCH iconv-Windows.R
### ^^^^^^^^^^^^^^^^^^^^^^^^^^= === ===== =============== ==> producing
iconv-Windows.Rout
###
sessionInfo() ## does not matter so much
## -- should be Windows to exhibit the problems
## From help(iconv) 's example : Using "latin1" European language letters:
x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x1) <- "latin1"
xU <- iconv(x1, "latin1", "UTF-8")
## 2 locales that do not work well : ---------------------------------
Sys.setlocale("LC_CTYPE", "Chinese")
iconv(x1, "latin1", "") # NA NA NA
iconv(x1, "latin1", "//TRANSLIT") # perfect for Chinese
iconv(x1, "latin1", "", sub = "byte")
iconv(xU, "UTF-8", "") # NA NA NA
iconv(xU, "UTF-8", "//TRANSLIT")
iconv(xU, "UTF-8", "", sub = "byte")
##--
Sys.setlocale("LC_CTYPE", "Arabic")
iconv(x1, "latin1", "") # NA NA NA
iconv(x1, "latin1", "//TRANSLIT") # not bad, but not perfect
iconv(x1, "latin1", "", sub="byte")
iconv(x1, "latin1", "", sub="?")
iconv(xU, "UTF-8", "") # NA NA NA
iconv(xU, "UTF-8", "//TRANSLIT")
iconv(xU, "UTF-8", "", sub="byte")
iconv(xU, "UTF-8", "", sub="?")
## 2 locales that work well for these examples (no wonder) -----------
Sys.setlocale("LC_CTYPE", "German_Switzerland")
iconv(x1, "latin1", "")
iconv(x1, "latin1", "//TRANSLIT")
iconv(x1, "latin1", "", sub="?")
iconv(xU, "UTF-8", "")
iconv(xU, "UTF-8", "//TRANSLIT")
iconv(xU, "UTF-8", "", sub="?")
##--
Sys.setlocale("LC_CTYPE", "English")
iconv(x1, "latin1", "")
iconv(x1, "latin1", "//TRANSLIT")
iconv(x1, "latin1", "", sub="?")
iconv(xU, "UTF-8", "")
iconv(xU, "UTF-8", "//TRANSLIT")
iconv(xU, "UTF-8", "", sub="?")
iconv-Windows.Rout
R Under development (unstable) (2017-06-25 r72854) -- "Unsuffered Consequences"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
#### iconv() behavior depending on Locales LC_CTYPE in Windows
#### ======= ==============================
###
### In a *shell* in Windows (emacs), after doing R.home() in R, use that to do
something like
### c:/PROGRA~1/R/R-devel/bin/R CMD BATCH iconv-Windows.R
### ^^^^^^^^^^^^^^^^^^^^^^^^^^= === ===== =============== ==> producing
iconv-Windows.Rout
###
sessionInfo() ## does not matter so much
R Under development (unstable) (2017-06-25 r72854)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
## -- should be Windows to exhibit the problems
## From help(iconv) 's example : Using "latin1" European language letters:
x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x1) <- "latin1"
xU <- iconv(x1, "latin1", "UTF-8")
## 2 locales that do not work well : ---------------------------------
Sys.setlocale("LC_CTYPE", "Chinese")
[1] "Chinese (Simplified)_People's Republic of China.936"
iconv(x1, "latin1", "") # NA NA NA
[1] NA NA NA
iconv(x1, "latin1", "//TRANSLIT") # perfect for Chinese
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x1, "latin1", "", sub = "byte")
[1] "Ekstr<f8>m" "J<f6>reskog" "bi<df>chen Z¨¹rcher"
iconv(xU, "UTF-8", "") # NA NA NA
[1] NA NA NA
iconv(xU, "UTF-8", "//TRANSLIT")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "", sub = "byte")
[1] "Ekstr<c3><b8>m" "J<c3><b6>reskog" "bi<c3><9f>chen Z¨¹rcher"
##--
Sys.setlocale("LC_CTYPE", "Arabic")
[1] "Arabic_Saudi Arabia.1256"
iconv(x1, "latin1", "") # NA NA NA
[1] NA NA NA
iconv(x1, "latin1", "//TRANSLIT") # not bad, but not perfect
[1] "Ekstr\370m" "J\366reskog" "bißchen Zürcher"
iconv(x1, "latin1", "", sub="byte")
[1] "Ekstr<f8>m" "J<f6>reskog" "bi<df>chen Zürcher"
iconv(x1, "latin1", "", sub="?")
[1] "Ekstr?m" "J?reskog" "bi?chen Zürcher"
iconv(xU, "UTF-8", "") # NA NA NA
[1] NA NA NA
iconv(xU, "UTF-8", "//TRANSLIT")
[1] "Ekstr\370m" "J\366reskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "", sub="byte")
[1] "Ekstr<c3><b8>m" "J<c3><b6>reskog" "bi<c3><9f>chen Zürcher"
iconv(xU, "UTF-8", "", sub="?")
[1] "Ekstr??m" "J??reskog" "bi??chen Zürcher"
## 2 locales that work well for these examples (no wonder) -----------
Sys.setlocale("LC_CTYPE", "German_Switzerland")
[1] "German_Switzerland.1252"
iconv(x1, "latin1", "")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x1, "latin1", "//TRANSLIT")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x1, "latin1", "", sub="?")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "//TRANSLIT")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "", sub="?")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
##--
Sys.setlocale("LC_CTYPE", "English")
[1] "English_United States.1252"
iconv(x1, "latin1", "")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x1, "latin1", "//TRANSLIT")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(x1, "latin1", "", sub="?")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "//TRANSLIT")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
iconv(xU, "UTF-8", "", sub="?")
[1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
proc.time()
user system elapsed
0.18 0.14 0.98
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel