On Mon, 17 Jan 2011 10:14:19 -0500, spir <[email protected]> wrote:

On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
More over, Even if you ignore Hebrew as a tiny insignificant minority
you cannot do the same for Arabic which has over one *billion* people
that use that language.

I hope that the medium type works 'good enough' for those languages,
with the high level type needed for advanced usages.  At a minimum,
comparison and substring should work for all languages.

Hello Steven,

How does an application know that a given text, which supposedly is written in a given natural language (as for instance indicated by an html header) does not also hold terms from other languages? There are various occasions for this: quotations, use of foreign words, pointers...

A side-issue is raised by precomposed codes for composite characters. For most languages of the world, I guess (but unsure), all "official" characters have single-code representations. Good, but unfortunately this is not enforced by the standard (instead, the decomposed form can sensibly be considered the base form, but this is another topic). So that even if ones knows for sure that all characters of all texts an app will ever deal with can be mapped to single codes, to be safe one would have to normalise to NFC anyway (Normalised Form Composed). Then, where is the actual gain? In fact, it is a loss because NFC is more costly than NFD (Decomposed) --actually, the standard NFC algo first decomposes to NFD to initially get an unique representation that can then be more easily (re)composed via simple mappings.

For further information:
Unicode's normalisation algos: http://unicode.org/reports/tr15/
list of technical reports: http://unicode.org/reports/
(Unicode's technical reports are far more readible than the standard itself, but unfortunately often refer to it.)

I'll reply to this to save you the trouble. I have reversed my position since writing a lot of these posts.

In summary, I think strings should default to an element type of a grapheme, which should be implemented via a slice of the original data. Updated string type forthcoming.

-Steve

Reply via email to