Re: Notice/Warning on narrowStrings .length

Brad Anderson Thu, 26 Apr 2012 23:14:13 -0700

On Friday, 27 April 2012 at 00:25:44 UTC, H. S. Teoh wrote:

On Thu, Apr 26, 2012 at 06:13:00PM -0400, Nick Sabalausky wrote:
"H. S. Teoh" <[email protected]> wrote in messagenews:[email protected]...
[...]
> And don't forget that some code points (notably from the CJK> block)> are specified as "double-width", so if you're trying to do> text
> layout, you'll want yet a different length (layoutLength?).
>
Correction: the official term for this is "full-width" (asopposed to
the "half-width" of the typical European scripts).
Interesting. Kinda makes sence that such thing exists, though:The CJKcharacters (even the relatively simple Japanese *kanas) aredetailedenough that they need to be larger to achieve the samereadability.And that's the *non*-double-length ones. So I don't doubtthere's ones
that need to be tagged as "Draw Extra Big!!" :)
Have you seen U+9598? It's an insanely convoluted glyphcomposed of
*three copies* of an already extremely complex glyph.

        http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png

(And yes, that huge thing is supposed to fit inside a SINGLE
character... what *were* those ancient Chinese scribesthinking?!)
For example, I have my font size in Windows Notepad set to a
comfortable value. But when I want to use hiragana orkatakana, I haveto go into the settings and increase the font size so I canactuallyread it (Well, to what *little* extent I can even read it inthe first
place ;) ). And those kana's tend to be among the simplest CJK
characters.
(Don't worry - I only use Notepad as a quick-n-dirty scrapspace,
never for real coding/writing).
LOL... love the fact that you felt obligated to justify youruse of
notepad. :-P
> So we really need all four lengths. Ain't unicode fun?! :-)
>
No kidding. The *one* thing I really, really hate aboutUnicode is thefact that most (if not all) of its complexity actually *is*necessary.
We're lucky the more imaginative scribes of the world haveeither beendead for centuries or have restricted themselves to writingfictionallanguages. :-) The inventions of the dead ones have beencodified andsimplified by the unfortunate people who inherited their overlycomplexsystems (*cough*CJK glyphs*cough), and the inventions of theliving onesare largely ignored by the world due to the fact that, well,their
scripts are only useful for writing fictional languages. :-)
So despite the fact that there are still some crazy convolutedstuff outthere, such as Arabic or Indic scripts with pair-wisesubstitution rules
in Unicode, overall things are relatively tame. At least the
subcomponents of CJK glyphs are no longer productive (activelybeingused to compose new characters by script users) -- can youimagine theinsanity if Unicode had to support composition by thoseradicals and
subparts? Or if Unicode had to support a script like this one:

        http://www.arthaey.com/conlang/ashaille/writing/sarapin.html
whose components are graphically composed in, shall we say,entirelynon-trivial ways (see the composed samples at the bottom of thepage)?
Unicode *itself* is undisputably necessary, but I do sure missASCII.
In an ideal world, where memory is not an issue and bus width is
indefinitely wide, a Unicode string would simply be a sequenceofintegers (of arbitrary size). Things like combining diacritics,etc.,would have dedicated bits/digits for representing them, sothere's noneed of the complexity of UTF-8, UTF-16, etc.. Everything fitsinto asingle character. Every possible combination of diacritics oneverypossible character has a unique representation as a singleinteger.
String length would be equal to glyph count.
In such an ideal world, screens would also be of indefinitelydetailedresolution, so anything can fit inside a single grid cell, sothere's noneed of half-width/double-width distinctions. You could portancientASCII-centric C code just by increasing sizeof(char), andthings would
Just Work.

Yeah I know. Totally impossible. But one can dream, right? :-)


[...]
> I've been thinking about unicode processing recently.> Traditionally,> we have to decode narrow strings into UTF-32 (aka dchar)> then do> table lookups and such. But unicode encoding and properties,> etc.,> are static information (at least within a single unicode> release).
> So why bother with hardcoding tables and stuff at all?
>
> What we *really* should be doing, esp. for commonly-used> functions> like computing various lengths, is to automatically process> said> tables and encode the computation in finite-state machines> that can
> then be optimized at the FSM level (there are known algos for
> generating optimal FSMs), codegen'd, and then optimized> again at the> assembly level by the compiler. These FSMs will operate at> the> native narrow string char type level, so that there will be> no need
> for explicit decoding.
>
> The generation algo can then be run just once per unicode> release,
> and everything will Just Work.
>

While I find that very intersting...I'm afraid I don't actually
understand your suggestion :/ (I do understand FSM's and howthey
work, though) Could you give a little example of what you mean?
[...]
Currently, std.uni code (argh the pun!!) is hand-written withtables ofwhich character belongs to which class, etc.. These hand-codedtablesare error-prone and unnecessary. For example, think ofcomputing thelayout width of a UTF-8 stream. Why waste time decoding intodchar, andthen doing all sorts of table lookups to compute the width?Instead,treat the stream as a byte stream, with certain sequences ofbytesevaluating to length 2, others to length 1, and yet others tolength 0.
A lexer engine is perfectly suited for recognizing these kindsofsequences with optimal speed. The only difference from a reallexer isthat instead of spitting out tokens, it keeps a running total(layout)
length, which is output at the end.
So what we should do is to write a tool that processesUnicode.txt (theofficial table of character properties from the Unicodestandard) and
generates lexer engines that compute various Unicode properties
(grapheme count, layout length, etc.) for each of the UTFencodings.
This way, we get optimal speed for these algorithms, plus wedon't need
to manually maintain tables and stuff, we just run the tool on
Unicode.txt each time there's a new Unicode release, and thecorrect
code will be generated automatically.


T

I'm not sure if you or others knew or not (I didn't until justnow as there hasn't been an announcement) but one of the acceptedGSOC projects is extending unicode support by Dmitry Olshansky.Maybe take up this idea with him.


https://www.google-melange.com/gsoc/project/google/gsoc2012/dolsh/31002

Regards,
Brad Anderson

Re: Notice/Warning on narrowStrings .length

Reply via email to