On Sat, 15 Jan 2011 13:21:12 -0500, foobar <[email protected]> wrote:

Steven Schveighoffer Wrote:


English and (if I understand correctly) most other languages.  Any
language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work
to some degree (for example, opEquals).

What I'm proposing (or think I'm proposing) is not exactly catering to
English and ASCII, what I'm proposing is simply not catering to more
complex languages such as Hebrew and Arabic. What I'm trying to find is a
middle ground where most languages work, and the code is simple and
efficient, with possibilities to jump down to lower levels for performance
(i.e. switch to char[] when you know ASCII is all you are using) or jump
up to full unicode when necessary.

Essentially, we would have three levels of types:

char[], wchar[], dchar[] -- Considered to be arrays in every way.
string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works
with any algorithm that deals with bidirectional ranges.  This is the
default string type, and the type for string literals.  Represented
internally by a single char[], wchar[] or dchar[] array.
* utfstring_t!T -- specialized string to deal with full unicode, which may
perform worse than string_t, but supports everything unicode supports.
May require a battery of specialized algorithms.

* - name up for discussion

Also note that phobos currently does *no* normalization as far as I can
tell for things like opEquals.  Two char[]'s that represent equivalent
strings, but not in the same way, will compare as !=.

-Steve

The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.

I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode. The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*. At least, that is how I see it. I'm looking at it as a code-reuse proposition.

It's like calendars. There are quite a few different calendars in different cultures. But most people use a Gregorian calendar. So we have three options:

a) Use a Gregorian calendar, and leave the other calendars to a 3rd party library b) Use a complicated calendar system where Gregorian calendars are treated with equal respect to all other calendars, none are the default. c) Use a Gregorian calendar by default, but include the other calendars as a separate module for those who wish to use them.

I'm looking at my proposal as more of a c) solution.

Can you show how normalization causes subtle bugs?

More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.

I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.

I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default.

You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.

Or French, or Spanish, or German, etc...

Look, even the lowest level is valid unicode, but if you want to start extracting individual graphemes, you need more machinery. In 99% of cases, I'd think you want to use strings as strings, not as sequences of graphemes, or code-units.

-Steve

Reply via email to