Re: [rust-dev] Proposed API for character encodings

Simon Sapin Fri, 13 Sep 2013 15:04:51 -0700

Here is an updated proposal, based on email and IRC feedback. Thechanges are:


* Fix .feed() and .flush() to have the self parameter they need.

* Remove the iterator stuff. I don’t find it super useful, and it’s easyenough to build on top of the "push" API. KISS.

* Duplicate the "one shot" convenience API in Decoder so that it’susable without involving trait objects and dynamic dispatch.

* Make the output generic in the low-level API by having StringWriterinstead of ~str


* Add encoding_from_label()

* De-emphasize the Encoding trait by moving it to the end. It is onlyuseful together with encoding_from_label() and other dynamic-dispatchscenarios. If the encoding to use is known at compile time, one can useeg. UTF8Decoder directly.

Again, this is only decoders. Encoders are basically the same, with [u8]and str swapped. Maybe the output could just be std::rt::io::Writer.



/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded
/// after DecodeError was returned.
trait Decoder {
    /// Simple, "one shot" API.
    /// Decode a single byte string that is entirely in memory.
    /// May raise the decoding_error condition.
    fn decode(input: &[u8]) -> Result<~str, DecodeError> {
        // Implementation left out.
        // This is a default method, but not meant to be overridden.
    }

    fn new() -> Self;

    /// Call this repeatedly with a chunck of input bytes.
    /// As much as possible of the decoded text is appended to output.
    /// May raise the decoding_error condition.
    fn feed<W: StringWriter>(&self, input: &[u8], output: &mut W)
                          -> Option<DecodeError>;

    /// Call this to indicate the end of the input.
    /// The Decoder instance should be discarded afterwards.
    /// Some encodings may append some final output at this point.
    /// May raise the decoding_error condition.
    fn flush<W: StringWriter>(&self, output: &mut W)
                           -> Option<DecodeError>;
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
    pub decoding_error : ~[u8] -> Option<~str>;
}

/// Functions to be used with decoding_error::cond.trap
mod decoding_error_handlers {
    fn fatal(_: ~[u8]) -> Option<~str> { None }
    fn replacement(_: ~[u8]) -> Option<~str> { Some(~"\uFFFD") }
}

struct DecodeError {
    input_byte_offset: uint,
    invalid_byte_sequence: ~[u8],
}

trait StringWriter {
    fn write_char(&mut self, c: char);
    fn write_str(&mut self, s: &str);
}


/// Only supports the set of labels defined in the spec
/// http://encoding.spec.whatwg.org/#encodings
/// Such a label can come eg. from an HTTP header:
/// Content-Type: text/plain; charset=<label>
fn encoding_from_label(label: &str) -> &'static Encoding {
    // Implementation left out
}

/// Types implementing this trait are "algorithms"
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are "encodings" as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
    /// Could become an associated type with a ::new() constructor
    /// when the language supports that.
    fn new_decoder(&self) -> ~Decoder;

    /// Simple, "one shot" API.
    /// Decode a single byte string that is entirely in memory.
    /// May raise the decoding_error condition.
    fn decode(&self, input: &[u8]) -> Result<~str, DecodeError> {
        // Implementation (using a Decoder) left out.
        // This is a default method, but not meant to be overridden.
    }
}


What do you think?
--
Simon Sapin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Proposed API for character encodings

Reply via email to