Re: VLERange: a range in between BidirectionalRange and

Steven Schveighoffer Sat, 15 Jan 2011 11:55:24 -0800

On Sat, 15 Jan 2011 13:21:12 -0500, foobar <[email protected]> wrote:

Steven Schveighoffer Wrote:

English and (if I understand correctly) most other languages.  Any
language which can be built from composable graphemes would work. Andinfact, ones that use some graphemes that cannot be composed will alsowork
to some degree (for example, opEquals).

What I'm proposing (or think I'm proposing) is not exactly catering to
English and ASCII, what I'm proposing is simply not catering to more
complex languages such as Hebrew and Arabic. What I'm trying to findis a
middle ground where most languages work, and the code is simple and
efficient, with possibilities to jump down to lower levels forperformance
(i.e. switch to char[] when you know ASCII is all you are using) or jump
up to full unicode when necessary.

Essentially, we would have three levels of types:

char[], wchar[], dchar[] -- Considered to be arrays in every way.
string_t!T (string, wstring, dstring) -- Specialized string types thatdonormalization to dchars, but do not handle perfectly all graphemes.Works
with any algorithm that deals with bidirectional ranges.  This is the
default string type, and the type for string literals.  Represented
internally by a single char[], wchar[] or dchar[] array.
* utfstring_t!T -- specialized string to deal with full unicode, whichmay
perform worse than string_t, but supports everything unicode supports.
May require a battery of specialized algorithms.

* - name up for discussion

Also note that phobos currently does *no* normalization as far as I can
tell for things like opEquals.  Two char[]'s that represent equivalent
strings, but not in the same way, will compare as !=.

-Steve
The above compromise provides zero benefit. The proposed default typestring_t is incorrect and will cause bugs. I prefer the standard lib tonot provide normalization at all and force me to use a 3rd party librather than provide an incomplete implementation that will give me afalse sense of correctness and cause very subtle and hard to find bugs.

I feel like you might be exaggerating, but maybe I'm completely wrong onthis, I'm not well-versed in unicode, or even languages that requireunicode. The clear benefit I see is that with a string type whichnormalizes to canonical code points, you can use this in any algorithmwithout having it be unicode-aware for *most languages*. At least, thatis how I see it. I'm looking at it as a code-reuse proposition.

It's like calendars. There are quite a few different calendars indifferent cultures. But most people use a Gregorian calendar. So we havethree options:

a) Use a Gregorian calendar, and leave the other calendars to a 3rd partylibraryb) Use a complicated calendar system where Gregorian calendars are treatedwith equal respect to all other calendars, none are the default.c) Use a Gregorian calendar by default, but include the other calendars asa separate module for those who wish to use them.


I'm looking at my proposal as more of a c) solution.

Can you show how normalization causes subtle bugs?

More over, Even if you ignore Hebrew as a tiny insignificant minorityyou cannot do the same for Arabic which has over one *billion* peoplethat use that language.

I hope that the medium type works 'good enough' for those languages, withthe high level type needed for advanced usages. At a minimum, comparisonand substring should work for all languages.

I firmly believe that in accordance with D's principle that the defaultbehavior should be the correct & safe option, D should have the fullunicode type (utfstring_t above) as the default.
You need only a subset of the functionality because you only useEnglish? For the same reason, you don't want the Unicode overhead? Usean ASCII type instead. In the same vain, a geneticist should use a DNAsequence type and not Unicode text.


Or French, or Spanish, or German, etc...

Look, even the lowest level is valid unicode, but if you want to startextracting individual graphemes, you need more machinery. In 99% ofcases, I'd think you want to use strings as strings, not as sequences ofgraphemes, or code-units.


-Steve

Re: VLERange: a range in between BidirectionalRange and

Reply via email to