On 6/30/11 6:01 PM, Gregg Tavares (wrk) wrote:
On Tue, Jun 21, 2011 at 10:17 AM, Arun Ranganathan <a...@mozilla.com
<mailto:a...@mozilla.com>> wrote:
Sorry if these have all been discussed before. I just read the
File API for the first time and 2 random questions popped in my
head.
1) If I'm using readAsText with a particular encoding and the
data in the file is not actually in that encoding such that code
points in the file can not be mapped to valid code points what
happens? Is that implementation specific or is it specified? I
can imagine at least 3 different behaviors.
This should be specified better and isn't. I'm inclined to then
return the file in the encoding it is in rather than force an
encoding (in other words, ignore the encoding parameter if it is
determined that code points can't be mapped to valid code points
in the encoding... also note that we say to "Replace bytes or
sequences of bytes that are not valid according to thecharsetwith
a single U+FFFD character [Unicode
<http://dev.w3.org/2006/webapi/FileAPI/#Unicode>]"). Right now,
the spec isn't specific to this scenario ("... if the user agent
cannot decode blob using encoding, then let charset be null"
before the algorithmic steps, which essentially forces UTF-8).
Can we list your three behaviors here, just so we get them on
record? Which behavior do you think is ideal? More importantly,
is substituting U+FFFD and "defaulting" to UTF-8 good enough for
your scenario above?
The 3 off the top of my head were
1) Throw an exception. (content not valid for encoding)
2) Remap bad codes to some other value (sounds like that's the one above)
3) Remove the bad character
I see you've listed a 4th, "Ignore the encoding on error, assume
utf-8". That one seems problematic because of partial reads. If you
are decoding as shift-jis, have returned a partial read, and then
later hit a bad code point, the stuff you've seen previously will all
need to change by switching to no encoding.
I'd chose #2 which it sounds like is already the case according the spec.
This is the case in the spec. currently, but:
Regardless of what solution is chosen is there a way for me to know
something was lost?
I don't think so, actually. And I'm not entirely sure how we can allow
for such a way, unless we throw an error or something.
2) If I'm reading using readAsText a multibyte encoding (utf-8,
shift-jis, etc..) is it implementation dependent whether or not
it can return partial characters when returning partial results
during reading? In other words, Let's say the next character in
a file is a 3 byte code point but the reader has only read 2 of
those 3 bytes so far. Is implementation dependent whether result
includes those 2 bytes before reading the 3rd byte or not?
Yes, partial results are currently implementation dependent; the
spec. only says they SHOULD be returned. There was reluctance to
have MUST condition on partial file reads. I'm open to revisiting
this decision if the justification is a really good one.
I'm assuming by "MUST condition" you mean a UA doesn't have to support
partial reads at all, not that how partial reads work shouldn't be
specified.
Here's an example.
Assume we stick with unknown characters get mapped to U+FFFD.
Assume my stream is utf8 and in hex the bytes are.
E3 83 91 E3 83 91
That's 2 code points of 0x30D1. Now assume the reader has only read
the first 5 bytes.
Should the partial results be
(a) filereader.result.length == 1 where the content is 0x30D1
or should the partial result be
(b) filereader.result.length == 2 where the content is 0x30D1, 0xFFFD
because at that point the E3 83 at the end of the partial result is
not a valid codepoint
I think the spec should specify that if the UA supports partial reads
it should follow example (a)
OK. I think the spec. needs more bolstering here. Thanks for your
example. This makes it clearer.
-- A*