On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu <[email protected]> said:

On 12/31/11 2:04 AM, Walter Bright wrote:

We're chasing phantoms here, and I worry a lot about over-engineering
trivia.

I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.

Perfect? At one time Java and other frameworks started to use UTF-16 as if they were characters, that turned wrong on them. Now we know that not even code points should be considered characters, thanks to characters spanning on multiple code points. You might call it perfect, but for that you have made two assumptions:

1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable

Ranges of code points might be perfect for you, but it's a tradeoff that won't work in every situations.

The whole concept of generic algorithms working on strings efficiently doesn't work. Applying generic algorithms to strings by treating them as a range of code points is both wasteful (because it forces you to decode everything) and incomplete (because of multi-code-point characters) and it should be avoided. Algorithms working on Unicode strings should be designed with Unicode in mind. And the best way to design efficient Unicode algorithms is to access the array of code units directly and read each character at the level of abstraction required and know what you're doing.

I'm not against making strings more opaque to encourage people to use the Unicode algorithms from the standard library instead of rolling their own. But I doubt the current approach of using .raw alone will prevent many from doing dumb things. On the other side I'm sure it'll make it it more complicated to write Unicode algorithms because accessing and especially slicing the raw content of char[] will become tiresome. I'm not convinced it's a net win.

As for Walter being the only one coding by looking at the code units directly, that's not true. All my parser code look at code units directly and only decode to code points where necessary (just look at the XML parsing code I posted a while ago to get an idea to how it can apply to ranges). And I don't think it's because I've seen Walter code before, I think it is because I know how Unicode works and I want to make my parser efficient. I've done the same for a parser in C++ a while ago. I can hardly imagine I'm the only one (with Walter and you). I think this is how efficient algorithms dealing with Unicode should be written.

--
Michel Fortin
[email protected]
http://michelf.com/

Reply via email to