Re: string is rarely useful as a function argument

Michel Fortin Sat, 31 Dec 2011 06:21:18 -0800

On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu<[email protected]> said:

On 12/31/11 2:04 AM, Walter Bright wrote:
We're chasing phantoms here, and I worry a lot about over-engineering
trivia.
I disagree. I understand that seems trivia to you, but that doesn'tmake your opinion any less wrong, not to mention provincial throughinsistence it's applicable beyond a small team of experts. Again: Iknow no other - I literally mean not one - person who writes stringcode like you do (and myself after learning it from you); the currentsystem is adequate; the proposed system is perfect - save for breakingbackwards compatibility, which makes the discussion moot. But it beingmoot does not afford me to concede this point. I am right.

Perfect? At one time Java and other frameworks started to use UTF-16 asif they were characters, that turned wrong on them. Now we know thatnot even code points should be considered characters, thanks tocharacters spanning on multiple code points. You might call it perfect,but for that you have made two assumptions:


1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable

Ranges of code points might be perfect for you, but it's a tradeoffthat won't work in every situations.

The whole concept of generic algorithms working on strings efficientlydoesn't work. Applying generic algorithms to strings by treating themas a range of code points is both wasteful (because it forces you todecode everything) and incomplete (because of multi-code-pointcharacters) and it should be avoided. Algorithms working on Unicodestrings should be designed with Unicode in mind. And the best way todesign efficient Unicode algorithms is to access the array of codeunits directly and read each character at the level of abstractionrequired and know what you're doing.

I'm not against making strings more opaque to encourage people to usethe Unicode algorithms from the standard library instead of rollingtheir own. But I doubt the current approach of using .raw alone willprevent many from doing dumb things. On the other side I'm sure it'llmake it it more complicated to write Unicode algorithms becauseaccessing and especially slicing the raw content of char[] will becometiresome. I'm not convinced it's a net win.

As for Walter being the only one coding by looking at the code unitsdirectly, that's not true. All my parser code look at code unitsdirectly and only decode to code points where necessary (just look atthe XML parsing code I posted a while ago to get an idea to how it canapply to ranges). And I don't think it's because I've seen Walter codebefore, I think it is because I know how Unicode works and I want tomake my parser efficient. I've done the same for a parser in C++ awhile ago. I can hardly imagine I'm the only one (with Walter and you).I think this is how efficient algorithms dealing with Unicode should bewritten.


--
Michel Fortin
[email protected]
http://michelf.com/

Re: string is rarely useful as a function argument

Reply via email to