I'm aware of UTF-8 only in Julia 0.5 and LegacyEncodings.jl (and some of 
the proposed changes in 0.6, still I think only for basic UTF-8 support, 
not full Unicode, e.g. collation).

[What/which language would have gold-standard Unicode (UTF-8) support, if 
not Perl; Rust (or Go)? Julia? Python? Other?]

I'm hoping there will never be a huge boilerplate header needed for good 
Unicode support, as in Perl (I was under the mistaken impression that Perl 
had good Unicode support; still might be the gold-standard for Unicode (and 
regex and string handling in general) support). At worst, if needed, then:

using ICU # any other needed? Maybe:


See list at the bottom (or full answer at stackoverflow), at least for 
education, on the can-of-worms that is full Unicode (UTF-8) support.


"The Julia <http://julialang.org> programming language has excellent 
support for Unicode."

For sure? If not, what is needed the most?


"Titlecase info is provided by UTF8proc, but it would be nice to have a 
little wrapper routine like utf8proc_uppercase to make it easier to access."

E.g. Titlecase (see below) was interesting to me, that there was a third 
case; and that numbers can be upper and lower case(?) or does he mean sub- 
super-script? I knew some of what I quote below, but note full list, 
includes more of the non-obscure issues.

There are some other optional Unicode packages, at least what I'm aware of:



๐ŸŒด ๐Ÿช๐Ÿซ๐Ÿช๐Ÿซ๐Ÿช ๐ŸŒž *๐•ฒ๐–”  ๐•ฟ๐–๐–”๐–š  ๐–†๐–“๐–‰  ๐•ฏ๐–”  ๐•ท๐–Ž๐–๐–Š๐–œ๐–Ž๐–˜๐–Š*
 ๐ŸŒž ๐Ÿช๐Ÿซ๐Ÿช ๐Ÿ 
๐“”๐“ญ๐“ฒ๐“ฝ :  ๐™Ž๐™ž๐™ข๐™ฅ๐™ก๐™š๐™จ๐™ฉ *โ„ž*:  ๐Ÿ• 
๐˜ฟ๐™ž๐™จ๐™˜๐™ง๐™š๐™ฉ๐™š  ๐™๐™š๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™™๐™–๐™ฉ๐™ž๐™ค๐™ฃ๐™จ

[Skipped list, that is for Perl; *What would be similar for Julia 0.5?*]

๐ŸŽ…    ๐•น ๐–”   ๐•ธ ๐–† ๐–Œ ๐–Ž ๐–ˆ   ๐•ญ ๐–š ๐–‘ ๐–‘ ๐–Š ๐–™   ๐ŸŽ… 

Saying that โ€œPerl should [*somehow!*] enable Unicode by defaultโ€ doesnโ€™t 
even start to begin to think about getting around to saying enough to be 
even marginally useful in some sort of rare and isolated case. Unicode is 
much much more than just a larger character repertoire; itโ€™s also how those 
characters all interact in many, many ways.

Even the simple-minded minimal measures that (some) people seem to think 
they want are guaranteed to miserably break millions of lines of code, code 
that has no chance to โ€œupgradeโ€ to your spiffy new *Brave New World* 

๐Ÿ’ก   ๐•ด๐–‰๐–Š๐–†๐–˜   ๐–‹๐–”๐–—  ๐–†   ๐–€๐–“๐–Ž๐–ˆ๐–”๐–‰๐–Š โธ— ๐•ฌ๐–œ๐–†๐–—๐–Š   ๐Ÿช   
๐•ท๐–†๐–š๐–“๐–‰๐–—๐–ž ๐•ท๐–Ž๐–˜๐–™   ๐Ÿ’ก 

At a minimum, here are some things that would appear to be required for ๐Ÿช 
to โ€œenable Unicode by defaultโ€, as you put it:

[24-item list; again Perl-specific. Some/all(?) apply to Julia, at least 

11. String comparisons in ๐Ÿช using eq, ne, lc, cmp, sort, &c&cc are always 
wrong. So instead of @a = sort @b, you need @a = 
Unicode::Collate->new->sort(@b). Might as well add that to your export 
PERL5OPTS=-MUnicode::Collate. You can cache the key for binary comparisons.

๐Ÿ’ฉ     ๐”ธ ๐•ค ๐•ค ๐•ฆ ๐•ž ๐•–   ๐”น ๐•ฃ ๐•  ๐•œ ๐•– ๐•Ÿ ๐•Ÿ ๐•– ๐•ค ๐•ค     ๐Ÿ’ฉ 

And thatโ€™s not all. There are million broken assumptions that people make 
about Unicode. Until they understand these things, their ๐Ÿช code will be 

[Applies to Julia and all other languages]

4. Code that assumes Perl uses UTFโ€‘8 internally is wrong.

6. Code that assumes Perl code points are limited to 0x10_FFFF is wrong.

9. Code that assumes every lowercase code point has a distinct uppercase 
one, or vice versa, is broken. For example, "ยช" is a lowercase letter with 
no uppercase; whereas both "แตƒ" and "แดฌ" are letters, but they are not 
lowercase letters; however, they are both lowercase code points without 
corresponding uppercase versions. Got that? They are *not* 
\p{Lowercase_Letter}, despite being both \p{Letter} and \p{Lowercase}.

10. Code that assumes changing the case doesnโ€™t change the length of the 
string is broken.

11. Code that assumes there are only two cases is broken. Thereโ€™s also 

12. Code that assumes only letters have case is broken. Beyond just 
letters, it turns out that numbers, symbols, and even marks have case. In 
fact, changing the case can even make something change its main general 
category, like a \p{Mark} turning into a \p{Letter}. It can also make it 
switch from one script to another.

14. Code that assumes Unicode gives a fig about POSIX locales is broken.

15. Code that assumes you can remove diacritics to get at base ASCII 
letters is evil, still, broken, brain-damaged, wrong, and justification for 
capital punishment.

26. Code that assumes that it cannot use "\x{FFFF}" is wrong.

28. Code that transcodes from UTFโ€16 or UTFโ€32 with leading BOMs into UTFโ€8 
is broken if it puts a BOM at the start of the resulting UTF-8. This is so 
stupid the engineer should have their eyelids removed.

29. Code that assumes the CESU-8 is a valid UTF encoding is wrong. 
Likewise, code that thinks encoding U+0000 as "\xC0\x80" is UTF-8 is broken 
and wrong. These guys also deserve the eyelid treatment.

30. Code that assumes characters like > always points to the right and < 
always points to the left are wrong โ€” because they in fact do not.

31. Code that assumes if you first output character X and then character Y, 
that those will show up as XY is wrong. Sometimes they donโ€™t.

32. *Code that assumes that ASCII is good enough for writing English 
properly is stupid, shortsighted, illiterate, broken, evil, and wrong.* Off 
with their heads! If that seems too extreme, we can compromise: henceforth 
they may type only with their big toe from one foot (the rest still be 

38. Code that believes \p{InLatin} is the same as \p{Latin} is heinously 

39. Code that believe that \p{InLatin} is almost ever useful is almost 
certainly wrong.

40. Code that believes that given $FIRST_LETTER as the first letter in some 
alphabet and $LAST_LETTER as the last letter in that same alphabet, that 
[${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost 
always complete broken and wrong and meaningless.

41. Code that believes someoneโ€™s name can only contain certain characters 
is stupid, offensive, and wrong.

42. Code that tries to reduce Unicode to ASCII is not merely wrong, its 
perpetrator should never be allowed to work in programming again. Period. 
Iโ€™m not even positive they should even be allowed to see again, since it 
obviously hasnโ€™t done them much good so far.

43. Code that believes thereโ€™s some way to pretend textfile encodings donโ€™t 
exist is broken and dangerous. Might as well poke the other eye out, too.

44. Code that converts unknown characters to ? is broken, stupid, 
braindead, and runs contrary to the standard recommendation, which says *NOT 
TO DO THAT!* RTFM for why not.

45. Code that believes it can reliably guess the encoding of an unmarked 
textfile is guilty of a fatal mรฉlange of hubris and naรฏvetรฉ that only a 
lightning bolt from Zeus will fix.

[I believe heuristics, for if text is (or isn't) UTF-8, are however pretty 
good, and useful sometimes.]

47. Code that believes once you successfully create a file by a given name, 
that when you run ls or readdir on its enclosing directory, youโ€™ll actually 
find that file with the name you created it under is buggy, broken, and 
wrong. Stop being surprised by this!

48. Code that believes UTF-16 is a fixed-width encoding is stupid, broken, 
and wrong. Revoke their programming licence.

50. Code that believes that stuff like /s/i can only match "S" or "s" is 
broken and wrong. Youโ€™d be surprised.

52. People who want to go back to the ASCII world should be whole-heartedly 
encouraged to do so, and in honor of their glorious upgrade they should be 
provided *gratis* with a pre-electric manual typewriter for all their 
data-entry needs. Messages sent to them should be send via an แด€สŸสŸแด„แด€แด˜s 
telegraph at 40 characters per line and hand-delivered by a courier. STOP.

  ๐ŸŽ ๐Ÿช   ๐•ญ๐–”๐–Ž๐–‘๐–Š๐–—โธ—๐–•๐–‘๐–†๐–™๐–Š  ๐–‹๐–”๐–—  ๐–€๐–“๐–Ž๐–ˆ๐–”๐–‰๐–Šโธ—๐•ฌ๐–œ๐–†๐–—๐–Š  
๐•ฎ๐–”๐–‰๐–Š   ๐Ÿช ๐ŸŽ 

[In Perl, the minimum boilerplate header for Unicode is 13 lines of use 
statements, it seems; but more than twice as long(?)]

๐Ÿ˜ฑ     ๐•พ ๐–€ ๐•ธ ๐•ธ ๐•ฌ ๐•ฝ ๐–„     ๐Ÿ˜ฑ 

I donโ€™t know how much more โ€œdefault Unicode in ๐Ÿชโ€ you can get than what 
Iโ€™ve written. Well, yes I do: you should be using Unicode::Collate and 
Unicode::LineBreak, too. And probably more.


Nothing but brain, and I mean *real brain*, will suffice here. Thereโ€™s a 
heck of a lot of stuff you have to learn. Modulo the retreat to the manual 
typewriter, you simply cannot hope to sneak by in ignorance. This is the 
21หขแต— century, and you cannot wish Unicode away by willful ignorance.


You may be able to get a few reasonable defaults for a very few and very 
limited operations, but not without thinking about things a whole lot more 
than I think you have.

As just one example, canonical ordering is going to cause some real 
headaches. ๐Ÿ˜ญ"\x{F5}" *โ€˜รตโ€™*, "o\x{303}" *โ€˜รตโ€™*, "o\x{303}\x{304}" *โ€˜ศญโ€™*, and 
"o\x{304}\x{303}" *โ€˜ลฬƒโ€™* should all match *โ€˜รตโ€™*, but how in the world are 
you going to do that? This is harder than it looks, but itโ€™s something you 
need to account for. ๐Ÿ’ฃ 

If thereโ€™s one thing I know about Perl, it is what its Unicode bits do and 
do not do, and this thing I promise you:  *โ€œ ฬฒแด›ฬฒสœฬฒแด‡ฬฒส€ฬฒแด‡ฬฒ ฬฒษชฬฒsฬฒ ฬฒษดฬฒแดฬฒ 
ฬฒUฬฒษดฬฒษชฬฒแด„ฬฒแดฬฒแด…ฬฒแด‡ฬฒ ฬฒแดฬฒแด€ฬฒษขฬฒษชฬฒแด„ฬฒ ฬฒส™ฬฒแดœฬฒสŸฬฒสŸฬฒแด‡ฬฒแด›ฬฒ ฬฒ โ€*  ๐Ÿ˜ž 

You cannot just change some defaults and get smooth sailing. Itโ€™s true that 
I run ๐Ÿช with PERL_UNICODE set to "SA", but thatโ€™s all, and even that is 
mostly for command-line stuff. For real work, I go through all the many 
steps outlined above, and I do it very, ** very** carefully.
๐Ÿ˜ˆ ยกฦจdlษ™ษฅ ฦจแด‰ษฅส‡ ษ™doษฅ puษ สปฮปษp ษ™ษ”แด‰u ษ ษ™สŒษษฅ สปสžษ”nl pooโ… ๐Ÿ˜ˆ

I'm not sure how much is outdated, at least for Perl, but note he also has 
at least one comment:

@xenoterracide No I didnโ€™t use intentionally problematic code points; itโ€™s 
a plot to get you to install George Dourosโ€™s super-awesome Symbola font 
<http://users.teilar.gr/%7Eg1951d/>, which covers Unicode 6.0. ๐Ÿ˜ˆ @depesz 
There isnโ€™t room here to explain why each broken assuption is wrong. 
@leonbloy *Lots and lots* of this applies to Unicode in general, not just 
Perl. Some of this material may show up in ๐Ÿช Programming Perl ๐Ÿช, 4th 
edition <http://oreilly.com/catalog/9780596004927/>, due out in October. ๐ŸŽƒ 
Iโ€™ve one month left to โœ work on it, and *Unicode is แดแด‡ษขแด€* there; regexes, 
too โ€“ tchrist <http://stackoverflow.com/users/471272/tchrist> May 30 '11 at 


Reply via email to