Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer Sat, 15 Jan 2011 12:40:18 -0800

On Sat, 15 Jan 2011 15:31:23 -0500, Michel Fortin<[email protected]> wrote:

On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer"<[email protected]> said:
On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn<[email protected]> wrote:
Steven Schveighoffer wrote:
 ...
I think a good standard to evaluate our handling of Unicode is to see
how easy it is to do things the right way. In the above, foreachwouldslice the string grapheme by grapheme, and the == operator wouldperforma normalized comparison. While it works correctly, it's probablynot the
most efficient way to do thing however.
 I think this is a good alternative, but I'd rather not impose this on
people like myself who deal mostly with English. I think this shouldbe
possible to do with wrapper types or intermediate ranges which have
graphemes as elements (per my suggestion above).
 Does this sound reasonable?
 -Steve
If its a matter of choosing which is the 'default' range, I'd thinkproperunicode handling is more reasonable than catering for english / asciionly.
Especially since this is already the case in phobos string algorithms.
English and (if I understand correctly) most other languages. Anylanguage which can be built from composable graphemes would work. Andin fact, ones that use some graphemes that cannot be composed willalso work to some degree (for example, opEquals).What I'm proposing (or think I'm proposing) is not exactly cateringto English and ASCII, what I'm proposing is simply not catering tomore complex languages such as Hebrew and Arabic. What I'm trying tofind is a middle ground where most languages work, and the code issimple and efficient, with possibilities to jump down to lower levelsfor performance (i.e. switch to char[] when you know ASCII is all youare using) or jump up to full unicode when necessary.
Why don't we build a compiler with an optimizer that generates correctcode *almost* all of the time? If you are worried about it not producingcorrect code for a given function, you can just add"pragma(correct_code)" in front of that function to disable the riskyoptimizations. No harm done, right?
One thing I see very often, often on US web sites but also elsewhere, isthat if you enter a name with an accented letter in a form (say Émilie),very often the accented letter gets changed to another semi-randomcharacter later in the process. Why? Because somewhere in the processlies an encoding mismatch that no one thought about and no one testedfor. At the very least, the form should have rejected those unexpectedcharacters and show an error when it could.
Now, with proper Unicode handling up to the code point level, this kindof problem probably won't happen as often because the whole stack workswith UTF encodings. But are you going to validate all of your inputs tomake sure they have no combining code point?
Don't assume that because you're in the United States no one will try toenter characters where you don't expect them. People love to play withUnicode symbols for fun, putting them in their name, signature, or evendomain names (✪df.ws). Just wait until they discover they can combinethem. ☺̰̎! There is also a variety of combining mathematical symbolswith no pre-combined form, such as ≸. Writing in Arabic, Hebrew,Korean, or some other foreign language isn't a prerequisite to usecombining characters.
Essentially, we would have three levels of types:
 char[], wchar[], dchar[] -- Considered to be arrays in every way.
string_t!T (string, wstring, dstring) -- Specialized string types thatdo normalization to dchars, but do not handle perfectly all graphemes.Works with any algorithm that deals with bidirectional ranges. Thisis the default string type, and the type for string literals.Represented internally by a single char[], wchar[] or dchar[] array.* utfstring_t!T -- specialized string to deal with full unicode, whichmay perform worse than string_t, but supports everything unicodesupports. May require a battery of specialized algorithms.
 * - name up for discussion
Also note that phobos currently does *no* normalization as far as Ican tell for things like opEquals. Two char[]'s that representequivalent strings, but not in the same way, will compare as !=.
Basically, you're suggesting that the default way should be to handleUnicode *almost* right. And then, if you want to handle thing *really*right you need to be explicit about it by using "utfstring_t"? Iunderstand your motivation, but it sounds backward to me.

You make very good points. I concede that using dchar as the elementpoint is not correct for unicode strings.


-Steve

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to