I'm aware of UTF-8 only in Julia 0.5 and LegacyEncodings.jl (and some of the proposed changes in 0.6, still I think only for basic UTF-8 support, not full Unicode, e.g. collation). [What/which language would have gold-standard Unicode (UTF-8) support, if not Perl; Rust (or Go)? Julia? Python? Other?] I'm hoping there will never be a huge boilerplate header needed for good Unicode support, as in Perl (I was under the mistaken impression that Perl had good Unicode support; still might be the gold-standard for Unicode (and regex and string handling in general) support). At worst, if needed, then: using ICU # any other needed? Maybe: https://github.com/nolta/UnicodeExtras.jl See list at the bottom (or full answer at stackoverflow), at least for education, on the can-of-worms that is full Unicode (UTF-8) support. http://iaindunning.com/blog/julia-unicode.html "The Julia <http://julialang.org> programming language has excellent support for Unicode." For sure? If not, what is needed the most? https://github.com/JuliaLang/julia/issues/774 "Titlecase info is provided by UTF8proc, but it would be nice to have a little wrapper routine like utf8proc_uppercase to make it easier to access." E.g. Titlecase (see below) was interesting to me, that there was a third case; and that numbers can be upper and lower case(?) or does he mean sub- super-script? I knew some of what I quote below, but note full list, includes more of the non-obscure issues. There are some other optional Unicode packages, at least what I'm aware of: https://github.com/randy3k/UnicodeCompletion http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default ๐ด ๐ช๐ซ๐ช๐ซ๐ช ๐ *๐ฒ๐ ๐ฟ๐๐๐ ๐๐๐ ๐ฏ๐ ๐ท๐๐๐๐๐๐๐* ๐ ๐ช๐ซ๐ช ๐ ------------------------------ ๐๐ญ๐ฒ๐ฝ : ๐๐๐ข๐ฅ๐ก๐๐จ๐ฉ *โ*: ๐ ๐ฟ๐๐จ๐๐ง๐๐ฉ๐ ๐๐๐๐ค๐ข๐ข๐๐ฃ๐๐๐ฉ๐๐ค๐ฃ๐จ [Skipped list, that is for Perl; *What would be similar for Julia 0.5?*] ๐ ๐น ๐ ๐ธ ๐ ๐ ๐ ๐ ๐ญ ๐ ๐ ๐ ๐ ๐ ๐ Saying that โPerl should [*somehow!*] enable Unicode by defaultโ doesnโt even start to begin to think about getting around to saying enough to be even marginally useful in some sort of rare and isolated case. Unicode is much much more than just a larger character repertoire; itโs also how those characters all interact in many, many ways. Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to โupgradeโ to your spiffy new *Brave New World* modernity.[..] ๐ก ๐ด๐๐๐๐ ๐๐๐ ๐ ๐๐๐๐๐๐๐ โธ ๐ฌ๐๐๐๐ ๐ช ๐ท๐๐๐๐๐๐ ๐ท๐๐๐ ๐ก At a minimum, here are some things that would appear to be required for ๐ช to โenable Unicode by defaultโ, as you put it: [24-item list; again Perl-specific. Some/all(?) apply to Julia, at least translated] 11. String comparisons in ๐ช using eq, ne, lc, cmp, sort, &c&cc are always wrong. So instead of @a = sort @b, you need @a = Unicode::Collate->new->sort(@b). Might as well add that to your export PERL5OPTS=-MUnicode::Collate. You can cache the key for binary comparisons. ๐ฉ ๐ธ ๐ค ๐ค ๐ฆ ๐ ๐ ๐น ๐ฃ ๐ ๐ ๐ ๐ ๐ ๐ ๐ค ๐ค ๐ฉ And thatโs not all. There are million broken assumptions that people make about Unicode. Until they understand these things, their ๐ช code will be broken. [Applies to Julia and all other languages] 4. Code that assumes Perl uses UTFโ8 internally is wrong. 6. Code that assumes Perl code points are limited to 0x10_FFFF is wrong. 9. Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, "ยช" is a lowercase letter with no uppercase; whereas both "แต" and "แดฌ" are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are *not* \p{Lowercase_Letter}, despite being both \p{Letter} and \p{Lowercase}. 10. Code that assumes changing the case doesnโt change the length of the string is broken. 11. Code that assumes there are only two cases is broken. Thereโs also titlecase. 12. Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a \p{Mark} turning into a \p{Letter}. It can also make it switch from one script to another. 14. Code that assumes Unicode gives a fig about POSIX locales is broken. 15. Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment. 26. Code that assumes that it cannot use "\x{FFFF}" is wrong. 28. Code that transcodes from UTFโ16 or UTFโ32 with leading BOMs into UTFโ8 is broken if it puts a BOM at the start of the resulting UTF-8. This is so stupid the engineer should have their eyelids removed. 29. Code that assumes the CESU-8 is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as "\xC0\x80" is UTF-8 is broken and wrong. These guys also deserve the eyelid treatment. 30. Code that assumes characters like > always points to the right and < always points to the left are wrong โ because they in fact do not. 31. Code that assumes if you first output character X and then character Y, that those will show up as XY is wrong. Sometimes they donโt. 32. *Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong.* Off with their heads! If that seems too extreme, we can compromise: henceforth they may type only with their big toe from one foot (the rest still be ducktaped). 38. Code that believes \p{InLatin} is the same as \p{Latin} is heinously broken. 39. Code that believe that \p{InLatin} is almost ever useful is almost certainly wrong. 40. Code that believes that given $FIRST_LETTER as the first letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless. 41. Code that believes someoneโs name can only contain certain characters is stupid, offensive, and wrong. 42. Code that tries to reduce Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in programming again. Period. Iโm not even positive they should even be allowed to see again, since it obviously hasnโt done them much good so far. 43. Code that believes thereโs some way to pretend textfile encodings donโt exist is broken and dangerous. Might as well poke the other eye out, too. 44. Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says *NOT TO DO THAT!* RTFM for why not. 45. Code that believes it can reliably guess the encoding of an unmarked textfile is guilty of a fatal mรฉlange of hubris and naรฏvetรฉ that only a lightning bolt from Zeus will fix. [I believe heuristics, for if text is (or isn't) UTF-8, are however pretty good, and useful sometimes.] 47. Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, youโll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this! 48. Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence. 50. Code that believes that stuff like /s/i can only match "S" or "s" is broken and wrong. Youโd be surprised. 52. People who want to go back to the ASCII world should be whole-heartedly encouraged to do so, and in honor of their glorious upgrade they should be provided *gratis* with a pre-electric manual typewriter for all their data-entry needs. Messages sent to them should be send via an แดสสแดแดแดs telegraph at 40 characters per line and hand-delivered by a courier. STOP. ๐ ๐ช ๐ญ๐๐๐๐๐โธ๐๐๐๐๐ ๐๐๐ ๐๐๐๐๐๐๐โธ๐ฌ๐๐๐๐ ๐ฎ๐๐๐ ๐ช ๐ [In Perl, the minimum boilerplate header for Unicode is 13 lines of use statements, it seems; but more than twice as long(?)] ๐ฑ ๐พ ๐ ๐ธ ๐ธ ๐ฌ ๐ฝ ๐ ๐ฑ I donโt know how much more โdefault Unicode in ๐ชโ you can get than what Iโve written. Well, yes I do: you should be using Unicode::Collate and Unicode::LineBreak, too. And probably more. [..] Nothing but brain, and I mean *real brain*, will suffice here. Thereโs a heck of a lot of stuff you have to learn. Modulo the retreat to the manual typewriter, you simply cannot hope to sneak by in ignorance. This is the 21หขแต century, and you cannot wish Unicode away by willful ignorance. [..] You may be able to get a few reasonable defaults for a very few and very limited operations, but not without thinking about things a whole lot more than I think you have. As just one example, canonical ordering is going to cause some real headaches. ๐ญ"\x{F5}" *โรตโ*, "o\x{303}" *โรตโ*, "o\x{303}\x{304}" *โศญโ*, and "o\x{304}\x{303}" *โลฬโ* should all match *โรตโ*, but how in the world are you going to do that? This is harder than it looks, but itโs something you need to account for. ๐ฃ If thereโs one thing I know about Perl, it is what its Unicode bits do and do not do, and this thing I promise you: *โ ฬฒแดฬฒสฬฒแดฬฒสฬฒแดฬฒ ฬฒษชฬฒsฬฒ ฬฒษดฬฒแดฬฒ ฬฒUฬฒษดฬฒษชฬฒแดฬฒแดฬฒแด ฬฒแดฬฒ ฬฒแดฬฒแดฬฒษขฬฒษชฬฒแดฬฒ ฬฒสฬฒแดฬฒสฬฒสฬฒแดฬฒแดฬฒ ฬฒ โ* ๐ You cannot just change some defaults and get smooth sailing. Itโs true that I run ๐ช with PERL_UNICODE set to "SA", but thatโs all, and even that is mostly for command-line stuff. For real work, I go through all the many steps outlined above, and I do it very, ** very** carefully. ------------------------------ ๐ ยกฦจdlษษฅ ฦจแดษฅส ษdoษฅ puษ สปฮปษp ษษแดu ษ ษสษษฅ สปสษnl pooโ ๐ I'm not sure how much is outdated, at least for Perl, but note he also has at least one comment: @xenoterracide No I didnโt use intentionally problematic code points; itโs a plot to get you to install George Dourosโs super-awesome Symbola font <http://users.teilar.gr/%7Eg1951d/>, which covers Unicode 6.0. ๐ @depesz There isnโt room here to explain why each broken assuption is wrong. @leonbloy *Lots and lots* of this applies to Unicode in general, not just Perl. Some of this material may show up in ๐ช Programming Perl ๐ช, 4th edition <http://oreilly.com/catalog/9780596004927/>, due out in October. ๐ Iโve one month left to โ work on it, and *Unicode is แดแดษขแด* there; regexes, too โ tchrist <http://stackoverflow.com/users/471272/tchrist> May 30 '11 at 17:38