Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin Mon, 17 Jan 2011 12:31:01 -0800

On 2011-01-17 12:33:04 -0500, Andrei Alexandrescu<[email protected]> said:

On 1/17/11 10:34 AM, Michel Fortin wrote:
As I said: all those people who are not validating the inputs to make
sure they don't contain combining code points.
The question (which I see you keep on dodging :o)) is how much textcontains combining code points.


Not much, right now.

The problem is that the answer to this question is likely to change asUnicode support improves in operating system and applications.Shouldn't we future-proof Phobos?

It does not serve us well to rigidly claim that the only good way ofdoing anything Unicode is to care about graphemes.


For the time being we can probably afford it.

Even NSString exposes the UTF16 underlying encoding and providesdedicated functions for grapheme-based processing. For one thing, ifyou care about the width of a word in printed text (one of the casewhere graphemes are important), you need font information. And -surprise! - some fonts do NOT support combining characters and printsigns next to one another instead of juxtaposing them, so the "wrong"method of counting characters is more informative.

Generally what OS X does in those case is that it displays thatcharacter in another font. That said, counting grapheme is never a goodway to tell how much space some text will take (unless the applicationenforces a fixed width per grapheme). It's more useful for telling thenumber of character in a text document, similar to a word count.

That said, no one should really have to care but those who implement the
string manipulation functions. The idea behind making the grapheme the
element type is to make it easier to write grapheme-aware string
manipulation functions, even if you don't know about graphemes. But the
reality is probably more mixed than that.
The reality is indeed more mixed. Inevitably at some point the APIneeds to answer the question: "what is the first character of thisstring?" Transparency is not possible. You break all string code outthere.


I'm not sure what you mean by that.

- - -

I gave some thought about all this, and came to an interesting
realizations that made me refine the proposal. The new proposal is
disruptive perhaps as much as the first, but in a different way.

But first, let's state a few facts to reframe the current discussion:

Fact 1: most people don't know Unicode very well
Fact 2: most people are confused by code units, code points, graphemes,
and what is a 'character'
Fact 3: most people won't bother with all this, they'll just use the
basic language facilities and assume everything work correctly if it it
works correctly for them


Nice :o).

Now, let's define two goals:

Goal 1: make most people's string operations work correctly
Goal 2: make most people's string operations work fast


Goal 3: don't break all existing code
Goal 4: make most people's string-based code easy to write and understand


Those are worthy goals too.

To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
sure we agree on this, but let's continue.
I think we disagree about what "most" means. For you it means "peoplewho don't understand Unicode well but deal with combining charactersanyway". For me it's "the largest percentage of D users across variouswriting systems".

It's not just D users, it's also for the users of programs written by Dusers. I can't count how many times I've seen accented charactermishandled on websites and elsewhere, and I probably have an aversionabout doing the same thing to people of other cultures and languages.

If the operating system supports combining marks, users have anexpectations that applications running on it will deal with themcorrectly too, and they'll (rightfully) blame your application if itdoesn't work. Same for websites.

I understand that in some situations you don't want to deal withgraphemes even if you theoretically should, but I don't think it shouldbe the default.

One observation I made with having dchar as the default element type is
that not all algorithms really need to deal with dchar. If I'm searching
for code point 'a' in a UTF-8 string, decoding code units into code
points is a waste of time. Why? because the only way to represent code
point 'a' is by having code point 'a'.


Right. That's why many algorithms in std are specialized for such cases.

And guess what? The almost same
optimization can apply to graphemes: if you're searching for 'a' in a
grapheme-aware manner in a UTF-8 string, all you have to do is search
for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
by a combining mark code point to confirm it is really a 'a', not a
composed grapheme. Iterating the string by code unit is enough for these
cases, and it'd increase performance by a lot.

Unfortunately it all breaks as soon as you go beyond one code point.You can't search efficiently, you can't compare efficiently.Boyer-Moore and friends are out.

Ok. Say you were searching for the needle "étoilé" in an UTF-8haystack, I see two way to extend the optimization described above:

1. search for the easy part "toil", then check its surroundinggraphemes to confirm it's really "étoilé"2. search for a code point matching 'é' or 'e', then confirm that thecode points following it form the right graphemes.

Implementing the second one can be done by converting the needle to aregular expression operating at code-unit level. With that you cansearch efficiently for the needle directly in code units without havingto decode and/or normalize the whole haystack.

This penalty with generic algorithms comes from the fact that they take
a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
strings because you always need to fully decode the string (by dchar or
by graphemes) for the purpose of calling the predicate. Given that
comparing characters for something else than equality or them being part
of a set is very rarely something you do, generic algorithms miss a big
optimization opportunity here.

How can we improve that? You can't argue for an inefficient scheme justbecause what we have isn't as efficient as it could possibly be.

You ask what's inefficient about generic algorithms having customizablepredicates? You can't implement the above optimization if you can'tguaranty the predicate is "==". That said, perhaps we can detect "=="and only apply the optimization then.

Being able to specify the predicate doesn't gain you much for strings,because a < 'a' doesn't make much sense. All you need to check for isequality with some value or membership of given character set, both ofwhich can use the optimization above.

So here's what I think we should do:

Todo 1: disallow generic algorithms on naked strings: string-specific
Unicode-aware algorithms should be used instead; they can share the same
name if their usage is similar
I don't understand this. We already do this, and by "Unicode-aware" weunderstand using dchar throughout. This is transparent to client code.

That's probably because you haven't understood the intent (I might nothave made it very clear either).

The problem I see currently is that you rely on dchar being the elementtype. That should be an implementation detail, not something clientcode can see or rely on. By making it an implementation detail, you canlater make grapheme-aware algorithms the default without changing theAPI. Since you're the gatekeeper to Phobos, you can make this changeconditional to getting an acceptable level of performance out of thegrapheme-aware algorithms, or on other factors like the amount ofcombining characters you encounter in the wild in the next few years.

So the general string functions would implement your compromise (usingdchar) but not commit indefinitely to it. Someone who really want towork in code point can use toDchar, someone who want to deal withgraphemes uses toGraphemes, someone who doesn't care won't chooseanything and get the default behaviour of compromise.

All you need to do for this is document it, and try to make sure thestring APIs don't force the implementation to work with code points.

Todo 2: to use a generic algorithm with a strings, you must dress the
string using one of toDchar, toGrapheme, toCodeUnits; this way your
intentions are clear
Breaks a lot of existing code.
Won't fly with Walter unless it solves world hunger. Nevertheless Ihave to say I like it; the #1 thing I'd change about the built-instrings is that they implicitly are two things at the same time. Askingfor representation should be explicit.

No, it doesn't break anything. This is just the continuation of what Itried to explain above: if you want to be sure you're working withgraphemes or dchar, say it.

Also, it said nothing about iteration or foreach, so I'm not sure whyit wouldn't fly with Walter. It can stay as it is, except for onething: you and Walter should really get on the same wavelengthregarding ElementType!(char[]) and foreach(c; string). I don't carethat much which is the default, but they absolutely need to be the same.

Todo 3: string-specific algorithms can implemented as simple wrappers
for generic algorithms with the string dressed correctly for the task,
or they can implement more sophisticated algorithms to increase performance
One thing I like about the current scheme is that allbidirectional-range algorithms work out of the box with all strings,and lend themselves to optimization whenever you want to.

I like this as the default behaviour too. I think however that youshould restrict the algorithms that work out of the box to those whichcan also work with graphemes. This way you can change the behaviour inthe future and support graphemes by a simple upgrade of Phobos.


Algorithms that doesn't work with graphemes would still work with toDchar.

So what doesn't work with graphemes? Predicates such as "a < b" forinstance. That's pretty much it.

This will have trouble passing Walter's wanking test. Mine too; everytime I need to write a bunch of forwarding functions, that's a signalsomething went wrong somewhere. Remember MFC? :o)

The idea is that we write the API as it would apply to graphemes, butwe implement it using dchar for the time being. Some functionsignatures might have to differ a bit.

Do you like that more?
This is not about liking. I like doing the right thing as much as youdo, and I think Phobos shows that. Clearly doing the right thingthrough and through is handling combining characters appropriately. Theproblem is keeping all desiderata in careful balance.

Well then, don't you find it balanced enough? I'm not asking thateverything be done with graphemes. I'm not even asking that anything bedone with graphemes by default. I'm only asking that we keep the APIclean enough so we can pass to graphemes by default in the futurewithout having to rewrite all the code everywhere to use byGrapheme. Ifthis isn't the right balance.



--
Michel Fortin
[email protected]
http://michelf.com/

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to