Hi Boris,
thanks for the feedback! Comments inline.
Boris Zbarsky wrote:
...
More precisely, what Gecko does here is to take the raw byte string and
byte-inflate it (by setting the high byte of each 16-bit code unit to 0
and the low byte to the corresponding byte of the given byte string)
before returning it to JS.
This happens to more or less match "decoding as ISO-8859-1", but not quite.
...
Not quite?
...
From HTTP's point of view, the header field value really is opaque. So
you can put there anything, as long as it fits into the header field
ABNF.
True; what does that mean for converting header values to 16-bit code
units in practice? Seems like byte-inflation might be the only
reasonable thing to do...
...
It at least preserves all the information that was there and would allow
a caller to re-decode as UTF-8 as a separate step.
Of course that only helps if senders and receivers agree on the
encoding.
True, but "encoding" here needs to mean more than just "encoding of
Unicode", since one can just stick random byte arrays, within the ABNF
restrictions, in the header, right?
Yes.
Right now there is no interoperable encoding, so the best thing to do in
APIs that use character sequences instead of octets is to preserve as
much information as possible.
It would be nice if we could find out whether anybody relies on the
current implementation. Maybe switch it back to byte inflation in
Mozilla trunk?
Best regards, Julian