Re: [rust-dev] UTF-8 strings versus "encoded ropes"

Malthe Borch Thu, 01 May 2014 11:38:37 -0700

On Thursday, May 1, 2014, Nathan Myers <[email protected]> wrote:

> It would be a mistake for a byte sequence container, stream, or string
> type to know anything about particular encodings. An encoding is an
> interpretation imposed on a byte sequence. Users of a sequence need to be
> able to choose what interpretation to apply without interference from some
> previous user's choice, and without need to make a copy.



You can "decode" an existing rope with an explicit codec without altering
the stream. It's metadata essentially.

As an example, from 8-bit raw to UTF-8. The byte stream does not change
unless you "encode" (which really transcodes as it flattens the rope).


> As an example, a given string may be seen as raw bytes, as a series of
> delimited records, as Unicode code points within some of those records, as
> a series of JSON name-value pairs within such a record, and as a decimal
> number in a JSON value part.  The same interpretations need to work on a
> raw byte stream that would not tolerate in-band Rust-specific annotations.


 The encode operation would be free if the rope has only a single leaf and
the codec is the same.

The UTF-8 view of a string is an interesting special case. Depending on
> context, what is considered a "character" may be a code point of at most 4
> bytes, or any number of bytes representing a base and combining characters
> which might or might not be collapsible to a canonical, single code point,
> or a series of such constructs that is to be displayed as a ligature such
> as "Qu" or "ffi". (Some languages are best displayed as mostly ligatures.)
>

I think it's convenient that the string provides an encoding-aware
interface. You normally want to work character by character, not byte by
byte, if you have specified an encoding. Otherwise, just don't declare and
use 8-bit raw.

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] UTF-8 strings versus "encoded ropes"

Reply via email to