Re: [rust-dev] Proposed API for character encodings

Olivier Renaud Fri, 20 Sep 2013 02:19:33 -0700

I really like the API you are proposing. In particular, the error handling is 
close to what I was expecting from such an API.


I have some remarks, though.

Is there a reason for encoders and decoders to not be reusable ? I think it 
would be reasonable to specify that they get back to their initial state once 
the 'flush' method is called, or when a 'DecodeError' is returned.

Is a condition raised when the order of method calls is not respected ? E.g. 
if one calls 'flush' multiple times, of calls 'feed' and then 'decode' ?

It is not clear what is given as a parameter to the 'decoding_error' 
condition. I guess it's the exact subset of byte sequence that cannot be 
decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it 
sufficient for variable-length encodings ?

I am doubtful that the encoder is just a decoder with [u8] and str swapped. A 
decoder must deal with a possibly invalid sequence of bytes, while an encoder 
deals with str, which is guaranteed to be a valid utf8 sequence. An encoder 
must handle unmappable characters, whereas a decoder doesn't (actually, it 
depends whether we consider unicode to be universal or not...).

I think it would be a good idea to make a difference between an invalid 
sequence and an unmappable character. I think there should be both an 
'invalid_sequence' and an 'unmappable_char' condition.

Also, the 'fatal' handler is a bit scary, based on the name I'd expect it to 
result in a 'fail!'.

I propose this set of conditions and handlers :

// Decoder conditions
condition! {
     /// The byte sequence is not a valid input
     pub invalid_sequence : ~[u8] -> Option<~str>;
     /// The byte sequence cannot be represented in Unicode (rarely used)
     pub unmappable_bytes : ~[u8] -> Option<~str>;
}

// Encoder condition
condition! {
     /// The Unicode string cannot be represented in the target encoding
     /// (essential for single byte encodings)
     pub unmappable_str : ~str -> Option<~[u8]>;
}

/// Functions to be used with invalid_sequence::cond.trap
/// or unmappable_bytes::cond.trap
mod decoding_error_handlers {
     fn decoder_error(_: ~[u8]) -> Option<~str> { None }
     fn replacement(_: ~[u8]) -> Option<~str> { Some(~"\uFFFD") }
     fn ascii_substitute(_: ~[u8]) -> Option<~str> { Some(~"\u001A") }
     fn ignore(_: ~[u8]) -> Option<~str> { Some(~"") }
}

/// Functions to be used with unmappable_str::cond.trap
mod encoding_error_handlers {
     fn decoder_error(_: ~str) -> Option<~[u8]> { None }
     fn ascii_substitute(_: ~str) -> Option<~[u8]> { Some(~[0x1A]) }
     fn ignore(_: ~str) -> Option<~[u8]> { Some(~[]) }
}

Not sure about this substitute/replacement duality. Maybe we can have only one 
function name 'default', that would be FFFD for unicode and 1A for ascii.
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Proposed API for character encodings

Reply via email to