Le 20/09/2013 10:18, Olivier Renaud a écrit :
I really like the API you are proposing. In particular, the error handling is
close to what I was expecting from such an API.
I have some remarks, though.
Is there a reason for encoders and decoders to not be reusable ? I think it
would be reasonable to specify that they get back to their initial state once
the 'flush' method is called, or when a 'DecodeError' is returned.
I don’t have a strong opinion on that. There could be a "reset" or
similar method, but I don’t see how this is better than just throwing
the decoder away and making a new one.
With static dispatch and the encoding known at compile-time, you can
probably have decoders on the stack so making a new one is cheap.
If the encoding is determined at run-time and you use trait objects
(dynamic dispatch) for decoders, the next input might have a different
encoding so reusing decoders might not be useful either.
Is a condition raised when the order of method calls is not respected ? E.g.
if one calls 'flush' multiple times, of calls 'feed' and then 'decode' ?
Decoder::decode is a static method / associated function. It’s
independent from everything else.
Other than that, I don’t know. rust-encoding doesn’t do that. AFAIU it
leaves this behavior undefined, which I think is fine. Do you think it
should be explicitly checked for?
It is not clear what is given as a parameter to the 'decoding_error'
condition. I guess it's the exact subset of byte sequence that cannot be
decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it
sufficient for variable-length encodings ?
Correct, and I think yes. It is called once every time the spec says to
"run the error algorithm":
http://encoding.spec.whatwg.org/#error
I am doubtful that the encoder is just a decoder with [u8] and str swapped. A
decoder must deal with a possibly invalid sequence of bytes, while an encoder
deals with str, which is guaranteed to be a valid utf8 sequence. An encoder
must handle unmappable characters, whereas a decoder doesn't
You’re right, I cut some corners. In particular, the encoding_error
condition can take a single (unsupported) 'char'. Other than that, the
*API* is (very close to?) the same with [u8] and str swapped.
(actually, it
depends whether we consider unicode to be universal or not...).
I suggest we consider it is. (For the purpose of the WHATWG spec it is.)
If Unicode is missing things, the "right" solution is to add things to
Unicode.
I think it would be a good idea to make a difference between an invalid
sequence and an unmappable character. I think there should be both an
'invalid_sequence' and an 'unmappable_char' condition.
That’s the distinction between decoding_error and encoding_error, which
already exists.
Also, the 'fatal' handler is a bit scary, based on the name I'd expect it to
result in a 'fail!'.
I’m open to other names. Maybe "abort"? The idea is that you reject the
entirety of this input (including previous successful calls to .feed())
I propose this set of conditions and handlers :
// Decoder conditions
condition! {
/// The byte sequence is not a valid input
pub invalid_sequence : ~[u8] -> Option<~str>;
/// The byte sequence cannot be represented in Unicode (rarely used)
pub unmappable_bytes : ~[u8] -> Option<~str>;
}
// Encoder condition
condition! {
/// The Unicode string cannot be represented in the target encoding
/// (essential for single byte encodings)
pub unmappable_str : ~str -> Option<~[u8]>;
}
I think that unmappable_bytes is not needed, and the other two should
just be decoding_error and encoding_error. (See above.)
/// Functions to be used with invalid_sequence::cond.trap
/// or unmappable_bytes::cond.trap
mod decoding_error_handlers {
fn decoder_error(_: ~[u8]) -> Option<~str> { None }
fn replacement(_: ~[u8]) -> Option<~str> { Some(~"\uFFFD") }
fn ascii_substitute(_: ~[u8]) -> Option<~str> { Some(~"\u001A") }
fn ignore(_: ~[u8]) -> Option<~str> { Some(~"") }
}
/// Functions to be used with unmappable_str::cond.trap
mod encoding_error_handlers {
fn decoder_error(_: ~str) -> Option<~[u8]> { None }
fn ascii_substitute(_: ~str) -> Option<~[u8]> { Some(~[0x1A]) }
fn ignore(_: ~str) -> Option<~[u8]> { Some(~[]) }
}
Not sure about this substitute/replacement duality. Maybe we can have only one
function name 'default', that would be FFFD for unicode and 1A for ascii.
I think we should only provide two handlers for each of decoding and
encoding: fail/abort/error, and replace. The latter is U+FFFD
(replacement character) for decoding and 0x3F (ASCII question mark) for
encoding as in the WHATWG spec, per web-compatibility constraints.
In particular, "ignore" is terrible and should not be encouraged.
(Depending on what you’re doing with it, it could lead to security
issues.) If you do want "ignore" or "ASCII substitute", writing a custom
condition handler is easy enough.
--
Simon Sapin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev