[julia-users] Julia Unicode (UTF-8) support (vs. Perl..); Also includes humourous, educational, list (part of, adviced to read all if you program [in Perl]..)

Páll Haraldsson Wed, 12 Oct 2016 08:31:37 -0700

I'm aware of UTF-8 only in Julia 0.5 and LegacyEncodings.jl (and some of
the proposed changes in 0.6, still I think only for basic UTF-8 support,
not full Unicode, e.g. collation).

[What/which language would have gold-standard Unicode (UTF-8) support, if
not Perl; Rust (or Go)? Julia? Python? Other?]

I'm hoping there will never be a huge boilerplate header needed for good
Unicode support, as in Perl (I was under the mistaken impression that Perl
had good Unicode support; still might be the gold-standard for Unicode (and
regex and string handling in general) support). At worst, if needed, then:

using ICU # any other needed? Maybe:

https://github.com/nolta/UnicodeExtras.jl

See list at the bottom (or full answer at stackoverflow), at least for
education, on the can-of-worms that is full Unicode (UTF-8) support.

http://iaindunning.com/blog/julia-unicode.html

"The Julia <http://julialang.org> programming language has excellent
support for Unicode."

For sure? If not, what is needed the most?

https://github.com/JuliaLang/julia/issues/774

"Titlecase info is provided by UTF8proc, but it would be nice to have a
little wrapper routine like utf8proc_uppercase to make it easier to access."

E.g. Titlecase (see below) was interesting to me, that there was a third
case; and that numbers can be upper and lower case(?) or does he mean sub-
super-script? I knew some of what I quote below, but note full list,
includes more of the non-obscure issues.

There are some other optional Unicode packages, at least what I'm aware of:

https://github.com/randy3k/UnicodeCompletion

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

🌴 🐪🐫🐪🐫🐪 🌞 *𝕲𝖔 𝕿𝖍𝖔𝖚 𝖆𝖓𝖉 𝕯𝖔 𝕷𝖎𝖐𝖊𝖜𝖎𝖘𝖊*
🌞 🐪🐫🐪 🐁
------------------------------
𝓔𝓭𝓲𝓽 : 𝙎𝙞𝙢𝙥𝙡𝙚𝙨𝙩 *℞*: 𝟕
𝘿𝙞𝙨𝙘𝙧𝙚𝙩𝙚 𝙍𝙚𝙘𝙤𝙢𝙢𝙚𝙣𝙙𝙖𝙩𝙞𝙤𝙣𝙨

[Skipped list, that is for Perl; *What would be similar for Julia 0.5?*]

🎅 𝕹 𝖔 𝕸 𝖆 𝖌 𝖎 𝖈 𝕭 𝖚 𝖑 𝖑 𝖊 𝖙 🎅

Saying that “Perl should [*somehow!*] enable Unicode by default” doesn’t
even start to begin to think about getting around to saying enough to be
even marginally useful in some sort of rare and isolated case. Unicode is
much much more than just a larger character repertoire; it’s also how those
characters all interact in many, many ways.

Even the simple-minded minimal measures that (some) people seem to think
they want are guaranteed to miserably break millions of lines of code, code
that has no chance to “upgrade” to your spiffy new *Brave New World*
modernity.[..]

💡 𝕴𝖉𝖊𝖆𝖘 𝖋𝖔𝖗 𝖆 𝖀𝖓𝖎𝖈𝖔𝖉𝖊 ⸗ 𝕬𝖜𝖆𝖗𝖊 🐪
𝕷𝖆𝖚𝖓𝖉𝖗𝖞 𝕷𝖎𝖘𝖙 💡

At a minimum, here are some things that would appear to be required for 🐪
to “enable Unicode by default”, as you put it:

[24-item list; again Perl-specific. Some/all(?) apply to Julia, at least
translated]

11. String comparisons in 🐪 using eq, ne, lc, cmp, sort, &c&cc are always
wrong. So instead of @a = sort @b, you need @a =
Unicode::Collate->new->sort(@b). Might as well add that to your export
PERL5OPTS=-MUnicode::Collate. You can cache the key for binary comparisons.

💩 𝔸 𝕤 𝕤 𝕦 𝕞 𝕖 𝔹 𝕣 𝕠 𝕜 𝕖 𝕟 𝕟 𝕖 𝕤 𝕤 💩

And that’s not all. There are million broken assumptions that people make
about Unicode. Until they understand these things, their 🐪 code will be
broken.

[Applies to Julia and all other languages]

4. Code that assumes Perl uses UTF‑8 internally is wrong.

6. Code that assumes Perl code points are limited to 0x10_FFFF is wrong.

9. Code that assumes every lowercase code point has a distinct uppercase
one, or vice versa, is broken. For example, "ª" is a lowercase letter with
no uppercase; whereas both "ᵃ" and "ᴬ" are letters, but they are not
lowercase letters; however, they are both lowercase code points without
corresponding uppercase versions. Got that? They are *not*
\p{Lowercase_Letter}, despite being both \p{Letter} and \p{Lowercase}.

10. Code that assumes changing the case doesn’t change the length of the
string is broken.

11. Code that assumes there are only two cases is broken. There’s also
titlecase.

12. Code that assumes only letters have case is broken. Beyond just
letters, it turns out that numbers, symbols, and even marks have case. In
fact, changing the case can even make something change its main general
category, like a \p{Mark} turning into a \p{Letter}. It can also make it
switch from one script to another.

14. Code that assumes Unicode gives a fig about POSIX locales is broken.

15. Code that assumes you can remove diacritics to get at base ASCII
letters is evil, still, broken, brain-damaged, wrong, and justification for
capital punishment.

26. Code that assumes that it cannot use "\x{FFFF}" is wrong.

28. Code that transcodes from UTF‐16 or UTF‐32 with leading BOMs into UTF‐8
is broken if it puts a BOM at the start of the resulting UTF-8. This is so
stupid the engineer should have their eyelids removed.

29. Code that assumes the CESU-8 is a valid UTF encoding is wrong.
Likewise, code that thinks encoding U+0000 as "\xC0\x80" is UTF-8 is broken
and wrong. These guys also deserve the eyelid treatment.

30. Code that assumes characters like > always points to the right and <
always points to the left are wrong — because they in fact do not.

31. Code that assumes if you first output character X and then character Y,
that those will show up as XY is wrong. Sometimes they don’t.

32. *Code that assumes that ASCII is good enough for writing English
properly is stupid, shortsighted, illiterate, broken, evil, and wrong.* Off
with their heads! If that seems too extreme, we can compromise: henceforth
they may type only with their big toe from one foot (the rest still be
ducktaped).

38. Code that believes \p{InLatin} is the same as \p{Latin} is heinously
broken.

39. Code that believe that \p{InLatin} is almost ever useful is almost
certainly wrong.

40. Code that believes that given $FIRST_LETTER as the first letter in some
alphabet and $LAST_LETTER as the last letter in that same alphabet, that
[${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost
always complete broken and wrong and meaningless.

41. Code that believes someone’s name can only contain certain characters
is stupid, offensive, and wrong.

42. Code that tries to reduce Unicode to ASCII is not merely wrong, its
perpetrator should never be allowed to work in programming again. Period.
I’m not even positive they should even be allowed to see again, since it
obviously hasn’t done them much good so far.

43. Code that believes there’s some way to pretend textfile encodings don’t
exist is broken and dangerous. Might as well poke the other eye out, too.

44. Code that converts unknown characters to ? is broken, stupid,
braindead, and runs contrary to the standard recommendation, which says *NOT
TO DO THAT!* RTFM for why not.

45. Code that believes it can reliably guess the encoding of an unmarked
textfile is guilty of a fatal mélange of hubris and naïveté that only a
lightning bolt from Zeus will fix.

[I believe heuristics, for if text is (or isn't) UTF-8, are however pretty
good, and useful sometimes.]

47. Code that believes once you successfully create a file by a given name,
that when you run ls or readdir on its enclosing directory, you’ll actually
find that file with the name you created it under is buggy, broken, and
wrong. Stop being surprised by this!

48. Code that believes UTF-16 is a fixed-width encoding is stupid, broken,
and wrong. Revoke their programming licence.

50. Code that believes that stuff like /s/i can only match "S" or "s" is
broken and wrong. You’d be surprised.

52. People who want to go back to the ASCII world should be whole-heartedly
encouraged to do so, and in honor of their glorious upgrade they should be
provided *gratis* with a pre-electric manual typewriter for all their
data-entry needs. Messages sent to them should be send via an ᴀʟʟᴄᴀᴘs
telegraph at 40 characters per line and hand-delivered by a courier. STOP.

🎁 🐪 𝕭𝖔𝖎𝖑𝖊𝖗⸗𝖕𝖑𝖆𝖙𝖊 𝖋𝖔𝖗 𝖀𝖓𝖎𝖈𝖔𝖉𝖊⸗𝕬𝖜𝖆𝖗𝖊
𝕮𝖔𝖉𝖊 🐪 🎁

[In Perl, the minimum boilerplate header for Unicode is 13 lines of use
statements, it seems; but more than twice as long(?)]

😱 𝕾 𝖀 𝕸 𝕸 𝕬 𝕽 𝖄 😱

I don’t know how much more “default Unicode in 🐪” you can get than what
I’ve written. Well, yes I do: you should be using Unicode::Collate and
Unicode::LineBreak, too. And probably more.

[..]

Nothing but brain, and I mean *real brain*, will suffice here. There’s a
heck of a lot of stuff you have to learn. Modulo the retreat to the manual
typewriter, you simply cannot hope to sneak by in ignorance. This is the
21ˢᵗ century, and you cannot wish Unicode away by willful ignorance.

[..]

You may be able to get a few reasonable defaults for a very few and very
limited operations, but not without thinking about things a whole lot more
than I think you have.

As just one example, canonical ordering is going to cause some real
headaches. 😭"\x{F5}" *‘õ’*, "o\x{303}" *‘õ’*, "o\x{303}\x{304}" *‘ȭ’*, and
"o\x{304}\x{303}" *‘ō̃’* should all match *‘õ’*, but how in the world are
you going to do that? This is harder than it looks, but it’s something you
need to account for. 💣

If there’s one thing I know about Perl, it is what its Unicode bits do and
do not do, and this thing I promise you: *“ ̲ᴛ̲ʜ̲ᴇ̲ʀ̲ᴇ̲ ̲ɪ̲s̲ ̲ɴ̲ᴏ̲
̲U̲ɴ̲ɪ̲ᴄ̲ᴏ̲ᴅ̲ᴇ̲ ̲ᴍ̲ᴀ̲ɢ̲ɪ̲ᴄ̲ ̲ʙ̲ᴜ̲ʟ̲ʟ̲ᴇ̲ᴛ̲ ̲ ”* 😞

You cannot just change some defaults and get smooth sailing. It’s true that
I run 🐪 with PERL_UNICODE set to "SA", but that’s all, and even that is
mostly for command-line stuff. For real work, I go through all the many
steps outlined above, and I do it very, ** very** carefully.
------------------------------
😈 ¡ƨdləɥ ƨᴉɥʇ ədoɥ puɐ ʻλɐp əɔᴉu ɐ əʌɐɥ ʻʞɔnl poo⅁ 😈

I'm not sure how much is outdated, at least for Perl, but note he also has
at least one comment:

@xenoterracide No I didn’t use intentionally problematic code points; it’s
a plot to get you to install George Douros’s super-awesome Symbola font
<http://users.teilar.gr/%7Eg1951d/>, which covers Unicode 6.0. 😈 @depesz
There isn’t room here to explain why each broken assuption is wrong.
@leonbloy *Lots and lots* of this applies to Unicode in general, not just
Perl. Some of this material may show up in 🐪 Programming Perl 🐪, 4th
edition <http://oreilly.com/catalog/9780596004927/>, due out in October. 🎃
I’ve one month left to ✍ work on it, and *Unicode is ᴍᴇɢᴀ* there; regexes,
too – tchrist <http://stackoverflow.com/users/471272/tchrist> May 30 '11 at
17:38

[julia-users] Julia Unicode (UTF-8) support (vs. Perl..); Also includes humourous, educational, list (part of, adviced to read all if you program [in Perl]..)

Reply via email to