Sorry if these have all been discussed before. I just read the File
API for the first time and 2 random questions popped in my head.
1) If I'm using readAsText with a particular encoding and the data in
the file is not actually in that encoding such that code points in the
file can not be mapped to valid code points what happens? Is that
implementation specific or is it specified? I can imagine at least 3
different behaviors.
This should be specified better and isn't. I'm inclined to then return
the file in the encoding it is in rather than force an encoding (in
other words, ignore the encoding parameter if it is determined that code
points can't be mapped to valid code points in the encoding... also note
that we say to "Replace bytes or sequences of bytes that are not valid
according to thecharsetwith a single U+FFFD character [Unicode
<http://dev.w3.org/2006/webapi/FileAPI/#Unicode>]"). Right now, the
spec isn't specific to this scenario ("... if the user agent cannot
decode blob using encoding, then let charset be null" before the
algorithmic steps, which essentially forces UTF-8).
Can we list your three behaviors here, just so we get them on record?
Which behavior do you think is ideal? More importantly, is
substituting U+FFFD and "defaulting" to UTF-8 good enough for your
scenario above?
2) If I'm reading using readAsText a multibyte encoding (utf-8,
shift-jis, etc..) is it implementation dependent whether or not it can
return partial characters when returning partial results during
reading? In other words, Let's say the next character in a file is a
3 byte code point but the reader has only read 2 of those 3 bytes so
far. Is implementation dependent whether result includes those 2 bytes
before reading the 3rd byte or not?
Yes, partial results are currently implementation dependent; the spec.
only says they SHOULD be returned. There was reluctance to have MUST
condition on partial file reads. I'm open to revisiting this decision
if the justification is a really good one.
-- A*