Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin Fri, 14 Jan 2011 09:05:50 -0800

On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer"<[email protected]> said:

On Fri, 14 Jan 2011 08:59:35 -0500, spir <[email protected]> wrote:
The point is not playing like that with Unicode flexibility. Ratherthat composite characters are just normal thingies in most languagesof the world. Actually, on this point, english is a rare exception(discarding letters imported from foreign languages like french 'à');to the point of beeing, I guess, the only western language without anydiacritic.
Is it common to have multiple modifiers on a single character?

Not in my knowledge. But I rarely deal with non-latin texts, there'sprobably some scripts out there that takes advantage of this.

The problem I see with using decomposed canonical form for strings isthat we would have to return a dchar[] for each 'element', whichseverely complicates code that, for instance, only expects to handleEnglish.

Actually, returning a sliced char[] or wchar[] could also be valid.User-perceived characters are basically a substring of one or more codepoints. I'm not sure it complicates that much the semantics of thelanguage -- what's complicated about writing str.front == "a" insteadof str.front == 'a'? -- although it probably would complicate thegenerated code and make it a little slower.

In the case of NSString in Cocoa, you can only access the 'characters'in their UTF-16 form. But everything from comparison to search forsubstring is done using graphemes. It's like they implementedspecialized Unicode-aware algorithms for these functions. There's nogenericness about how it handles graphemes.


I'm not sure yet about what would be the right approach for D.

I was hoping to lazily transform a string into its composed canonicalform, allowing the (hopefully rare) exception when a composed characterdoes not exist. My thinking was that this at least gives a usefulstring representation for 90% of usages, leaving the remaining 10% ofusages to find a more complex representation (like your Text type).If we only get like 20% or 30% there by making dchar the element type,then we haven't made it useful enough.
Either way, we need a string type that can be compared canonically forthings like searches or opEquals.

I wonder if normalized string comparison shouldn't be built directly inthe char[] wchar[] and dchar[] types instead. Also bring the idea abovethat iterating on a string would yield graphemes as char[] and thiscode would work perfectly irrespective of whether you used combiningcharacters:


        foreach (grapheme; "exposé") {
                if (grapheme == "é")
                        break;
        }

I think a good standard to evaluate our handling of Unicode is to seehow easy it is to do things the right way. In the above, foreach wouldslice the string grapheme by grapheme, and the == operator wouldperform a normalized comparison. While it works correctly, it'sprobably not the most efficient way to do thing however.


--
Michel Fortin
[email protected]
http://michelf.com/

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to