On Sat, 15 Jan 2011 13:21:12 -0500, foobar <[email protected]> wrote:
Steven Schveighoffer Wrote:
English and (if I understand correctly) most other languages. Any
language which can be built from composable graphemes would work. And
in
fact, ones that use some graphemes that cannot be composed will also
work
to some degree (for example, opEquals).
What I'm proposing (or think I'm proposing) is not exactly catering to
English and ASCII, what I'm proposing is simply not catering to more
complex languages such as Hebrew and Arabic. What I'm trying to find
is a
middle ground where most languages work, and the code is simple and
efficient, with possibilities to jump down to lower levels for
performance
(i.e. switch to char[] when you know ASCII is all you are using) or jump
up to full unicode when necessary.
Essentially, we would have three levels of types:
char[], wchar[], dchar[] -- Considered to be arrays in every way.
string_t!T (string, wstring, dstring) -- Specialized string types that
do
normalization to dchars, but do not handle perfectly all graphemes.
Works
with any algorithm that deals with bidirectional ranges. This is the
default string type, and the type for string literals. Represented
internally by a single char[], wchar[] or dchar[] array.
* utfstring_t!T -- specialized string to deal with full unicode, which
may
perform worse than string_t, but supports everything unicode supports.
May require a battery of specialized algorithms.
* - name up for discussion
Also note that phobos currently does *no* normalization as far as I can
tell for things like opEquals. Two char[]'s that represent equivalent
strings, but not in the same way, will compare as !=.
-Steve
The above compromise provides zero benefit. The proposed default type
string_t is incorrect and will cause bugs. I prefer the standard lib to
not provide normalization at all and force me to use a 3rd party lib
rather than provide an incomplete implementation that will give me a
false sense of correctness and cause very subtle and hard to find bugs.
I feel like you might be exaggerating, but maybe I'm completely wrong on
this, I'm not well-versed in unicode, or even languages that require
unicode. The clear benefit I see is that with a string type which
normalizes to canonical code points, you can use this in any algorithm
without having it be unicode-aware for *most languages*. At least, that
is how I see it. I'm looking at it as a code-reuse proposition.
It's like calendars. There are quite a few different calendars in
different cultures. But most people use a Gregorian calendar. So we have
three options:
a) Use a Gregorian calendar, and leave the other calendars to a 3rd party
library
b) Use a complicated calendar system where Gregorian calendars are treated
with equal respect to all other calendars, none are the default.
c) Use a Gregorian calendar by default, but include the other calendars as
a separate module for those who wish to use them.
I'm looking at my proposal as more of a c) solution.
Can you show how normalization causes subtle bugs?
More over, Even if you ignore Hebrew as a tiny insignificant minority
you cannot do the same for Arabic which has over one *billion* people
that use that language.
I hope that the medium type works 'good enough' for those languages, with
the high level type needed for advanced usages. At a minimum, comparison
and substring should work for all languages.
I firmly believe that in accordance with D's principle that the default
behavior should be the correct & safe option, D should have the full
unicode type (utfstring_t above) as the default.
You need only a subset of the functionality because you only use
English? For the same reason, you don't want the Unicode overhead? Use
an ASCII type instead. In the same vain, a geneticist should use a DNA
sequence type and not Unicode text.
Or French, or Spanish, or German, etc...
Look, even the lowest level is valid unicode, but if you want to start
extracting individual graphemes, you need more machinery. In 99% of
cases, I'd think you want to use strings as strings, not as sequences of
graphemes, or code-units.
-Steve