On 2011-01-17 12:33:04 -0500, Andrei Alexandrescu <[email protected]> said:

On 1/17/11 10:34 AM, Michel Fortin wrote:
As I said: all those people who are not validating the inputs to make
sure they don't contain combining code points.

The question (which I see you keep on dodging :o)) is how much text contains combining code points.

Not much, right now.

The problem is that the answer to this question is likely to change as Unicode support improves in operating system and applications. Shouldn't we future-proof Phobos?


It does not serve us well to rigidly claim that the only good way of doing anything Unicode is to care about graphemes.

For the time being we can probably afford it.


Even NSString exposes the UTF16 underlying encoding and provides dedicated functions for grapheme-based processing. For one thing, if you care about the width of a word in printed text (one of the case where graphemes are important), you need font information. And - surprise! - some fonts do NOT support combining characters and print signs next to one another instead of juxtaposing them, so the "wrong" method of counting characters is more informative.

Generally what OS X does in those case is that it displays that character in another font. That said, counting grapheme is never a good way to tell how much space some text will take (unless the application enforces a fixed width per grapheme). It's more useful for telling the number of character in a text document, similar to a word count.


That said, no one should really have to care but those who implement the
string manipulation functions. The idea behind making the grapheme the
element type is to make it easier to write grapheme-aware string
manipulation functions, even if you don't know about graphemes. But the
reality is probably more mixed than that.

The reality is indeed more mixed. Inevitably at some point the API needs to answer the question: "what is the first character of this string?" Transparency is not possible. You break all string code out there.

I'm not sure what you mean by that.


- - -

I gave some thought about all this, and came to an interesting
realizations that made me refine the proposal. The new proposal is
disruptive perhaps as much as the first, but in a different way.

But first, let's state a few facts to reframe the current discussion:

Fact 1: most people don't know Unicode very well
Fact 2: most people are confused by code units, code points, graphemes,
and what is a 'character'
Fact 3: most people won't bother with all this, they'll just use the
basic language facilities and assume everything work correctly if it it
works correctly for them

Nice :o).

Now, let's define two goals:

Goal 1: make most people's string operations work correctly
Goal 2: make most people's string operations work fast

Goal 3: don't break all existing code
Goal 4: make most people's string-based code easy to write and understand

Those are worthy goals too.


To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
sure we agree on this, but let's continue.

I think we disagree about what "most" means. For you it means "people who don't understand Unicode well but deal with combining characters anyway". For me it's "the largest percentage of D users across various writing systems".

It's not just D users, it's also for the users of programs written by D users. I can't count how many times I've seen accented character mishandled on websites and elsewhere, and I probably have an aversion about doing the same thing to people of other cultures and languages.

If the operating system supports combining marks, users have an expectations that applications running on it will deal with them correctly too, and they'll (rightfully) blame your application if it doesn't work. Same for websites.

I understand that in some situations you don't want to deal with graphemes even if you theoretically should, but I don't think it should be the default.


One observation I made with having dchar as the default element type is
that not all algorithms really need to deal with dchar. If I'm searching
for code point 'a' in a UTF-8 string, decoding code units into code
points is a waste of time. Why? because the only way to represent code
point 'a' is by having code point 'a'.

Right. That's why many algorithms in std are specialized for such cases.

And guess what? The almost same
optimization can apply to graphemes: if you're searching for 'a' in a
grapheme-aware manner in a UTF-8 string, all you have to do is search
for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
by a combining mark code point to confirm it is really a 'a', not a
composed grapheme. Iterating the string by code unit is enough for these
cases, and it'd increase performance by a lot.

Unfortunately it all breaks as soon as you go beyond one code point. You can't search efficiently, you can't compare efficiently. Boyer-Moore and friends are out.

Ok. Say you were searching for the needle "étoilé" in an UTF-8 haystack, I see two way to extend the optimization described above:

1. search for the easy part "toil", then check its surrounding graphemes to confirm it's really "étoilé" 2. search for a code point matching 'é' or 'e', then confirm that the code points following it form the right graphemes.

Implementing the second one can be done by converting the needle to a regular expression operating at code-unit level. With that you can search efficiently for the needle directly in code units without having to decode and/or normalize the whole haystack.


This penalty with generic algorithms comes from the fact that they take
a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
strings because you always need to fully decode the string (by dchar or
by graphemes) for the purpose of calling the predicate. Given that
comparing characters for something else than equality or them being part
of a set is very rarely something you do, generic algorithms miss a big
optimization opportunity here.

How can we improve that? You can't argue for an inefficient scheme just because what we have isn't as efficient as it could possibly be.

You ask what's inefficient about generic algorithms having customizable predicates? You can't implement the above optimization if you can't guaranty the predicate is "==". That said, perhaps we can detect "==" and only apply the optimization then.

Being able to specify the predicate doesn't gain you much for strings, because a < 'a' doesn't make much sense. All you need to check for is equality with some value or membership of given character set, both of which can use the optimization above.


So here's what I think we should do:

Todo 1: disallow generic algorithms on naked strings: string-specific
Unicode-aware algorithms should be used instead; they can share the same
name if their usage is similar

I don't understand this. We already do this, and by "Unicode-aware" we understand using dchar throughout. This is transparent to client code.

That's probably because you haven't understood the intent (I might not have made it very clear either).

The problem I see currently is that you rely on dchar being the element type. That should be an implementation detail, not something client code can see or rely on. By making it an implementation detail, you can later make grapheme-aware algorithms the default without changing the API. Since you're the gatekeeper to Phobos, you can make this change conditional to getting an acceptable level of performance out of the grapheme-aware algorithms, or on other factors like the amount of combining characters you encounter in the wild in the next few years.

So the general string functions would implement your compromise (using dchar) but not commit indefinitely to it. Someone who really want to work in code point can use toDchar, someone who want to deal with graphemes uses toGraphemes, someone who doesn't care won't choose anything and get the default behaviour of compromise.

All you need to do for this is document it, and try to make sure the string APIs don't force the implementation to work with code points.


Todo 2: to use a generic algorithm with a strings, you must dress the
string using one of toDchar, toGrapheme, toCodeUnits; this way your
intentions are clear

Breaks a lot of existing code.

Won't fly with Walter unless it solves world hunger. Nevertheless I have to say I like it; the #1 thing I'd change about the built-in strings is that they implicitly are two things at the same time. Asking for representation should be explicit.

No, it doesn't break anything. This is just the continuation of what I tried to explain above: if you want to be sure you're working with graphemes or dchar, say it.

Also, it said nothing about iteration or foreach, so I'm not sure why it wouldn't fly with Walter. It can stay as it is, except for one thing: you and Walter should really get on the same wavelength regarding ElementType!(char[]) and foreach(c; string). I don't care that much which is the default, but they absolutely need to be the same.


Todo 3: string-specific algorithms can implemented as simple wrappers
for generic algorithms with the string dressed correctly for the task,
or they can implement more sophisticated algorithms to increase performance

One thing I like about the current scheme is that all bidirectional-range algorithms work out of the box with all strings, and lend themselves to optimization whenever you want to.

I like this as the default behaviour too. I think however that you should restrict the algorithms that work out of the box to those which can also work with graphemes. This way you can change the behaviour in the future and support graphemes by a simple upgrade of Phobos.

Algorithms that doesn't work with graphemes would still work with toDchar.

So what doesn't work with graphemes? Predicates such as "a < b" for instance. That's pretty much it.


This will have trouble passing Walter's wanking test. Mine too; every time I need to write a bunch of forwarding functions, that's a signal something went wrong somewhere. Remember MFC? :o)

The idea is that we write the API as it would apply to graphemes, but we implement it using dchar for the time being. Some function signatures might have to differ a bit.


Do you like that more?

This is not about liking. I like doing the right thing as much as you do, and I think Phobos shows that. Clearly doing the right thing through and through is handling combining characters appropriately. The problem is keeping all desiderata in careful balance.

Well then, don't you find it balanced enough? I'm not asking that everything be done with graphemes. I'm not even asking that anything be done with graphemes by default. I'm only asking that we keep the API clean enough so we can pass to graphemes by default in the future without having to rewrite all the code everywhere to use byGrapheme. If this isn't the right balance.


--
Michel Fortin
[email protected]
http://michelf.com/

Reply via email to