Re: [rust-dev] Proposed API for character encodings

Simon Sapin Fri, 20 Sep 2013 03:47:33 -0700

Le 20/09/2013 10:18, Olivier Renaud a écrit :

I really like the API you are proposing. In particular, the error handling is
close to what I was expecting from such an API.


I have some remarks, though.

Is there a reason for encoders and decoders to not be reusable ? I think it
would be reasonable to specify that they get back to their initial state once
the 'flush' method is called, or when a 'DecodeError' is returned.

I don’t have a strong opinion on that. There could be a "reset" orsimilar method, but I don’t see how this is better than just throwingthe decoder away and making a new one.

With static dispatch and the encoding known at compile-time, you canprobably have decoders on the stack so making a new one is cheap.

If the encoding is determined at run-time and you use trait objects(dynamic dispatch) for decoders, the next input might have a differentencoding so reusing decoders might not be useful either.

Is a condition raised when the order of method calls is not respected ? E.g.
if one calls 'flush' multiple times, of calls 'feed' and then 'decode' ?

Decoder::decode is a static method / associated function. It’sindependent from everything else.

Other than that, I don’t know. rust-encoding doesn’t do that. AFAIU itleaves this behavior undefined, which I think is fine. Do you think itshould be explicitly checked for?

It is not clear what is given as a parameter to the 'decoding_error'
condition. I guess it's the exact subset of byte sequence that cannot be
decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it
sufficient for variable-length encodings ?

Correct, and I think yes. It is called once every time the spec says to"run the error algorithm":


http://encoding.spec.whatwg.org/#error

I am doubtful that the encoder is just a decoder with [u8] and str swapped. A
decoder must deal with a possibly invalid sequence of bytes, while an encoder
deals with str, which is guaranteed to be a valid utf8 sequence. An encoder
must handle unmappable characters, whereas a decoder doesn't

You’re right, I cut some corners. In particular, the encoding_errorcondition can take a single (unsupported) 'char'. Other than that, the*API* is (very close to?) the same with [u8] and str swapped.

(actually, it
depends whether we consider unicode to be universal or not...).

I suggest we consider it is. (For the purpose of the WHATWG spec it is.)If Unicode is missing things, the "right" solution is to add things toUnicode.

I think it would be a good idea to make a difference between an invalid
sequence and an unmappable character. I think there should be both an
'invalid_sequence' and an 'unmappable_char' condition.

That’s the distinction between decoding_error and encoding_error, whichalready exists.

Also, the 'fatal' handler is a bit scary, based on the name I'd expect it to
result in a 'fail!'.

I’m open to other names. Maybe "abort"? The idea is that you reject theentirety of this input (including previous successful calls to .feed())

I propose this set of conditions and handlers :

// Decoder conditions
condition! {
      /// The byte sequence is not a valid input
      pub invalid_sequence : ~[u8] -> Option<~str>;
      /// The byte sequence cannot be represented in Unicode (rarely used)
      pub unmappable_bytes : ~[u8] -> Option<~str>;
}

// Encoder condition
condition! {
      /// The Unicode string cannot be represented in the target encoding
      /// (essential for single byte encodings)
      pub unmappable_str : ~str -> Option<~[u8]>;
}

I think that unmappable_bytes is not needed, and the other two shouldjust be decoding_error and encoding_error. (See above.)

/// Functions to be used with invalid_sequence::cond.trap
/// or unmappable_bytes::cond.trap
mod decoding_error_handlers {
      fn decoder_error(_: ~[u8]) -> Option<~str> { None }
      fn replacement(_: ~[u8]) -> Option<~str> { Some(~"\uFFFD") }
      fn ascii_substitute(_: ~[u8]) -> Option<~str> { Some(~"\u001A") }
      fn ignore(_: ~[u8]) -> Option<~str> { Some(~"") }
}

/// Functions to be used with unmappable_str::cond.trap
mod encoding_error_handlers {
      fn decoder_error(_: ~str) -> Option<~[u8]> { None }
      fn ascii_substitute(_: ~str) -> Option<~[u8]> { Some(~[0x1A]) }
      fn ignore(_: ~str) -> Option<~[u8]> { Some(~[]) }
}

Not sure about this substitute/replacement duality. Maybe we can have only one
function name 'default', that would be FFFD for unicode and 1A for ascii.

I think we should only provide two handlers for each of decoding andencoding: fail/abort/error, and replace. The latter is U+FFFD(replacement character) for decoding and 0x3F (ASCII question mark) forencoding as in the WHATWG spec, per web-compatibility constraints.

In particular, "ignore" is terrible and should not be encouraged.(Depending on what you’re doing with it, it could lead to securityissues.) If you do want "ignore" or "ASCII substitute", writing a customcondition handler is easy enough.


--
Simon Sapin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Proposed API for character encodings

Reply via email to