Re: [rust-dev] Proposed API for character encodings

Olivier Renaud Fri, 20 Sep 2013 11:58:30 -0700

Le vendredi 20 septembre 2013 11:47:04 Simon Sapin a écrit :
> Le 20/09/2013 10:18, Olivier Renaud a écrit :
> > I really like the API you are proposing. In particular, the error handling
> > is close to what I was expecting from such an API.
> > 
> > I have some remarks, though.
> > 
> > Is there a reason for encoders and decoders to not be reusable ? I think
> > it
> > would be reasonable to specify that they get back to their initial state
> > once the 'flush' method is called, or when a 'DecodeError' is returned.
> I don’t have a strong opinion on that. There could be a "reset" or
> similar method, but I don’t see how this is better than just throwing
> the decoder away and making a new one.


I don't see the need for a 'reset' method. A decoder could return to its 
initial state after a call to 'finish'.

> With static dispatch and the encoding known at compile-time, you can
> probably have decoders on the stack so making a new one is cheap.
> 
> If the encoding is determined at run-time and you use trait objects
> (dynamic dispatch) for decoders, the next input might have a different
> encoding so reusing decoders might not be useful either.

My typical usage of a charset decoder is to read many files on disk, all of 
them using the same charset.

> > Is a condition raised when the order of method calls is not respected ?
> > E.g. if one calls 'flush' multiple times, of calls 'feed' and then
> > 'decode' ?
> Decoder::decode is a static method / associated function. It’s
> independent from everything else.

Oh yes of course, my bad.

> Other than that, I don’t know. rust-encoding doesn’t do that. AFAIU it
> leaves this behavior undefined, which I think is fine. Do you think it
> should be explicitly checked for?

Well, in  my opinion it is not a good idea for an API to have undefined 
behavior. Being explicit about what is disallowed also helps the user to 
understand how the API is supposed to be used. Also, I think it's preferable 
to fail fast, when the state of an object becomes invalid.

There are a handful of reasonable behavior, for the decoder :

* If reusing a decoder is legal, then calling 'feed' after 'finish' is legal 
(we start decoding a new stream), no need to introduce a special case. A 
second call to 'finish' can be a noop (we decode an empty stream)

* If reusing a decoder is illegal:

-- Calling 'feed' after 'finish' should be an error. The API must report that 
it is being misused by the programmer. I don't know what is the recommended 
way to do that in Rust. I think it's ok to fail!, or to have an assert. In 
Java, I'd throw an (unchecked) IllegalStateException, which serves exactly 
this purpose.

-- Calling 'finish' a second time can also be a noop, but it would be better to 
be consistent with the 'feed' after 'finish' behavior and to fail.

Another totally different solution would be to use phantom types, to indicate 
the state of the decoder, but that would be overkill. Or typestates :)

Simpler is better, so I think having a reusable decoder with no special 
"invalid" state is the least problematic solution.

> > It is not clear what is given as a parameter to the 'decoding_error'
> > condition. I guess it's the exact subset of byte sequence that cannot be
> > decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it
> > sufficient for variable-length encodings ?
> 
> Correct, and I think yes. It is called once every time the spec says to
> "run the error algorithm":
> 
> http://encoding.spec.whatwg.org/#error
> 
> > I am doubtful that the encoder is just a decoder with [u8] and str
> > swapped. A decoder must deal with a possibly invalid sequence of bytes,
> > while an encoder deals with str, which is guaranteed to be a valid utf8
> > sequence. An encoder must handle unmappable characters, whereas a decoder
> > doesn't
> 
> You’re right, I cut some corners. In particular, the encoding_error
> condition can take a single (unsupported) 'char'. Other than that, the
> *API* is (very close to?) the same with [u8] and str swapped.
> 
> > (actually, it
> > depends whether we consider unicode to be universal or not...).
> 
> I suggest we consider it is. (For the purpose of the WHATWG spec it is.)
> If Unicode is missing things, the "right" solution is to add things to
> Unicode.

It simplifies many things, indeed.

> > [...]
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Proposed API for character encodings

Reply via email to