On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin
<[email protected]> wrote:
On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer"
<[email protected]> said:
On Fri, 14 Jan 2011 08:59:35 -0500, spir <[email protected]> wrote:
The point is not playing like that with Unicode flexibility. Rather
that composite characters are just normal thingies in most languages
of the world. Actually, on this point, english is a rare exception
(discarding letters imported from foreign languages like french 'à');
to the point of beeing, I guess, the only western language without
any diacritic.
Is it common to have multiple modifiers on a single character?
Not in my knowledge. But I rarely deal with non-latin texts, there's
probably some scripts out there that takes advantage of this.
The problem I see with using decomposed canonical form for strings is
that we would have to return a dchar[] for each 'element', which
severely complicates code that, for instance, only expects to handle
English.
Actually, returning a sliced char[] or wchar[] could also be valid.
User-perceived characters are basically a substring of one or more code
points. I'm not sure it complicates that much the semantics of the
language -- what's complicated about writing str.front == "a" instead of
str.front == 'a'? -- although it probably would complicate the generated
code and make it a little slower.
Hm... this pushes the normalization outside the type, and into the
algorithms (such as find). I was hoping to avoid that. I think I can
come up with an algorithm that normalizes into canonical form as it
iterates. It just might return part of a grapheme if the grapheme cannot
be composed.
I do think that we could make a byGrapheme member to aid in this:
foreach(grapheme; s.byGrapheme) // grapheme is a substring that contains
one composed grapheme.
In the case of NSString in Cocoa, you can only access the 'characters'
in their UTF-16 form. But everything from comparison to search for
substring is done using graphemes. It's like they implemented
specialized Unicode-aware algorithms for these functions. There's no
genericness about how it handles graphemes.
I'm not sure yet about what would be the right approach for D.
I hope we can use generic versions, so the type itself handles the
conversions. That makes any algorithm using the string range correct.
I was hoping to lazily transform a string into its composed canonical
form, allowing the (hopefully rare) exception when a composed character
does not exist. My thinking was that this at least gives a useful
string representation for 90% of usages, leaving the remaining 10% of
usages to find a more complex representation (like your Text type).
If we only get like 20% or 30% there by making dchar the element type,
then we haven't made it useful enough.
Either way, we need a string type that can be compared canonically
for things like searches or opEquals.
I wonder if normalized string comparison shouldn't be built directly in
the char[] wchar[] and dchar[] types instead.
No, in my vision of how strings should be typed, char[] is an array, not a
string. It should be treated like an array of code-units, where two forms
that create the same grapheme are considered different.
Also bring the idea above that iterating on a string would yield
graphemes as char[] and this code would work perfectly irrespective of
whether you used combining characters:
foreach (grapheme; "exposé") {
if (grapheme == "é")
break;
}
I think a good standard to evaluate our handling of Unicode is to see
how easy it is to do things the right way. In the above, foreach would
slice the string grapheme by grapheme, and the == operator would perform
a normalized comparison. While it works correctly, it's probably not the
most efficient way to do thing however.
I think this is a good alternative, but I'd rather not impose this on
people like myself who deal mostly with English. I think this should be
possible to do with wrapper types or intermediate ranges which have
graphemes as elements (per my suggestion above).
Does this sound reasonable?
-Steve