Re: [rust-dev] First thoughts on Rust

Graydon Hoare Mon, 23 Jan 2012 12:48:08 -0800

On 23/01/2012 2:43 AM, Masklinn wrote:

On 2012-01-23, at 05:37 , Kevin Cantu wrote:

I'm curious though, because I've not used it in depth, what makes NSString
so good.  What does it do that Haskell's Text and other languages' string
types don't do?

First-class interaction with grapheme clusters (which it calls "composed 
characters)[0], I don't remember seeing that in any other language, and good 
first-class (not tucked in a library hidden out of the way) support for Unicode text 
manipulation algorithms (lower and upper case conversions, sorting, etc…)

You're asking for a locale-qualified composed-character type. That'smore than a string type. That's much higher up the ladder towards UI.

Str is more like int or float, or say one of those nice date-time typeslike TAI64. Not like a UI object representing "number" or "text" or"calendar date". Those UI concepts contain buckets of affordances thathave no relevance -- just cost and limitation -- to most instances ofthe datatype that occur lower down in the program.

Use of strings for display-to-humans-as-UI is actually a pretty narrowsubset of all the strings your average program works with. Making suchUI-level concepts the sole representation might make sense in a world where:


  - Performance doesn't matter. Space, time, etc.
  - Programmers never want to think of the thing in terms of its
    lower-level representation.

But IMO that's not the general condition.

  And what do you need from a core string library that
doesn't belong in, say, an extended package of ICU bindings?


As far as I am concerned, any string operation which is defined in Unicode should be 
either implemented "correctly" (according to unicode) on the string type or not 
at all (and delegated to a third-party library).

I agree with this. If it's a "string operation defined in unicode", weintend to ship the real thing. The things in libcore that claim to beunicode-y are correct unicode algorithms. We just don't ship *all* and*only* unicode algorithms in libcore; some will be pushed to libstd orfurther (punting to libICU) if they seem rare. Unicode is a huge standard.

Moreover, we do (and will continue to) ship algorithms on str valuesthat are not defined by unicode. More on this next...

This means, for instance, either string comparisons should implement the UCA or 
they should be forbidden.


I disagree with this.

In particular, I disagree with the idea that the only comparison thatexists is UCA comparison. UCA-comparison is not even a single operationat all: it's a *family* of highly customizable operations. That thedocumentation clearly lists a number of serious shortcomings of:


  - It's very slow.
  - It is not a stable sort.
  - It is not preserved under concatenation or substring.
  - It's highly variable: results vary by locale, legal system
    and organizational tailorings, phonetic dictionaries, etc. etc.
  - It will disagree with any tool doing codepoint or byte order.
  - It is subject to revision by the unicode consortium and may
    be found to be in wildly different states on opposite ends of
    a communication medium (say).

All this is not to disparage the fine work done by the consortium. UCAis a massive work of linguistic engineering. It's also inappropriate forjamming into the middle of all uses of strings. The authors are quiteclear on that:


  "The Unicode Collation Algorithm does not restrict the many different
   ways in which implementations can compare strings"

Most uses of strings involve computers talking to other computers, or tothemselves, not humans via a GUI. And most of those operations are morelike:


  - Bulk IO.
  - Use as keys in hashtables or balanced trees.
  - Substring and concatenation operations.

These operations do, regularly, have use for a "<" operator that doessomething a lot less than any particular locale-customization of UCA.Namely: memcmp. So that's what we do for <. It's a different operationthan any UCA operator. It has a different API. A proper API for a UCAoperation isn't even *expressible* as a < expression, since as the specstates "collation is not a property of strings". It has to be tailoredby locale and a dozen other features of the scripts.

Similarly, demanding the sole representation for our strings be inprecomposed-grapheme form means that all bulk IO on strings takes notjust a codepoint conversion hit, but a normalization-pass hit. That's avery high cost and there's no reason to assume "most" uses of stringsrequire it.

(Plus one can't even implement either of these things without shipping aprogram with an 18mb library. Again, this only makes sense in a worldwhere performance costs are somehow invisible or not counted.)

This, of course, does not apply to a bytes/[w8] type, which would operate 
solely at the byte level.

In rust, str is not [u8]. Str is unicode, [u8] is a step further down.Str is just held in the most common and future-proof unicode encoding toavoid constant round-tripping through different encodings andnormalizations during bulk IO, and to grant default operations like <and == based on performance and commonality estimates and our ownexperience writing code that uses strings.

I'm willing to listen to arguments about "commonality" to some extent,but I'd be very surprised if your position is that most programs you'veworked on would benefit from (say) their balanced trees implementingDUCET father than memcmp. I think many of them would just break, and theremainder would slow down by a factor of 100.

Str is, in other words, a point of tension between many forces, like intand float. One of those forces is definitely "be unicode" -- we have nointention of *burying* unicode-ness -- but performance, commonality,simplicity and compatibility are also concerns.


-Graydon
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] First thoughts on Rust

Reply via email to