On 23/01/2012 2:43 AM, Masklinn wrote:
On 2012-01-23, at 05:37 , Kevin Cantu wrote:
I'm curious though, because I've not used it in depth, what makes NSString
so good.  What does it do that Haskell's Text and other languages' string
types don't do?
First-class interaction with grapheme clusters (which it calls "composed 
characters)[0], I don't remember seeing that in any other language, and good 
first-class (not tucked in a library hidden out of the way) support for Unicode text 
manipulation algorithms (lower and upper case conversions, sorting, etc…)

You're asking for a locale-qualified composed-character type. That's more than a string type. That's much higher up the ladder towards UI.

Str is more like int or float, or say one of those nice date-time types like TAI64. Not like a UI object representing "number" or "text" or "calendar date". Those UI concepts contain buckets of affordances that have no relevance -- just cost and limitation -- to most instances of the datatype that occur lower down in the program.

Use of strings for display-to-humans-as-UI is actually a pretty narrow subset of all the strings your average program works with. Making such UI-level concepts the sole representation might make sense in a world where:

  - Performance doesn't matter. Space, time, etc.
  - Programmers never want to think of the thing in terms of its
    lower-level representation.

But IMO that's not the general condition.

  And what do you need from a core string library that
doesn't belong in, say, an extended package of ICU bindings?

As far as I am concerned, any string operation which is defined in Unicode should be 
either implemented "correctly" (according to unicode) on the string type or not 
at all (and delegated to a third-party library).

I agree with this. If it's a "string operation defined in unicode", we intend to ship the real thing. The things in libcore that claim to be unicode-y are correct unicode algorithms. We just don't ship *all* and *only* unicode algorithms in libcore; some will be pushed to libstd or further (punting to libICU) if they seem rare. Unicode is a huge standard.

Moreover, we do (and will continue to) ship algorithms on str values that are not defined by unicode. More on this next...

This means, for instance, either string comparisons should implement the UCA or 
they should be forbidden.

I disagree with this.

In particular, I disagree with the idea that the only comparison that exists is UCA comparison. UCA-comparison is not even a single operation at all: it's a *family* of highly customizable operations. That the documentation clearly lists a number of serious shortcomings of:

  - It's very slow.
  - It is not a stable sort.
  - It is not preserved under concatenation or substring.
  - It's highly variable: results vary by locale, legal system
    and organizational tailorings, phonetic dictionaries, etc. etc.
  - It will disagree with any tool doing codepoint or byte order.
  - It is subject to revision by the unicode consortium and may
    be found to be in wildly different states on opposite ends of
    a communication medium (say).

All this is not to disparage the fine work done by the consortium. UCA is a massive work of linguistic engineering. It's also inappropriate for jamming into the middle of all uses of strings. The authors are quite clear on that:

  "The Unicode Collation Algorithm does not restrict the many different
   ways in which implementations can compare strings"

Most uses of strings involve computers talking to other computers, or to themselves, not humans via a GUI. And most of those operations are more like:

  - Bulk IO.
  - Use as keys in hashtables or balanced trees.
  - Substring and concatenation operations.

These operations do, regularly, have use for a "<" operator that does something a lot less than any particular locale-customization of UCA. Namely: memcmp. So that's what we do for <. It's a different operation than any UCA operator. It has a different API. A proper API for a UCA operation isn't even *expressible* as a < expression, since as the spec states "collation is not a property of strings". It has to be tailored by locale and a dozen other features of the scripts.

Similarly, demanding the sole representation for our strings be in precomposed-grapheme form means that all bulk IO on strings takes not just a codepoint conversion hit, but a normalization-pass hit. That's a very high cost and there's no reason to assume "most" uses of strings require it.

(Plus one can't even implement either of these things without shipping a program with an 18mb library. Again, this only makes sense in a world where performance costs are somehow invisible or not counted.)

This, of course, does not apply to a bytes/[w8] type, which would operate 
solely at the byte level.

In rust, str is not [u8]. Str is unicode, [u8] is a step further down. Str is just held in the most common and future-proof unicode encoding to avoid constant round-tripping through different encodings and normalizations during bulk IO, and to grant default operations like < and == based on performance and commonality estimates and our own experience writing code that uses strings.

I'm willing to listen to arguments about "commonality" to some extent, but I'd be very surprised if your position is that most programs you've worked on would benefit from (say) their balanced trees implementing DUCET father than memcmp. I think many of them would just break, and the remainder would slow down by a factor of 100.

Str is, in other words, a point of tension between many forces, like int and float. One of those forces is definitely "be unicode" -- we have no intention of *burying* unicode-ness -- but performance, commonality, simplicity and compatibility are also concerns.

-Graydon
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to