On 6/30/11 6:01 PM, Gregg Tavares (wrk) wrote:


On Tue, Jun 21, 2011 at 10:17 AM, Arun Ranganathan <a...@mozilla.com <mailto:a...@mozilla.com>> wrote:

    Sorry if these have all been discussed before. I just read the
    File API for the first time and 2 random questions popped in my
    head.

    1) If I'm using readAsText with a particular encoding and the
    data in the file is not actually in that encoding such that code
    points in the file can not be mapped to valid code points what
    happens? Is that implementation specific or is it specified? I
    can imagine at least 3 different behaviors.

    This should be specified better and isn't.  I'm inclined to then
    return the file in the encoding it is in rather than force an
    encoding (in other words, ignore the encoding parameter if it is
    determined that code points can't be mapped to valid code points
    in the encoding... also note that we say to "Replace bytes or
    sequences of bytes that are not valid according to thecharsetwith
    a single U+FFFD character [Unicode
    <http://dev.w3.org/2006/webapi/FileAPI/#Unicode>]").  Right now,
    the spec isn't specific to this scenario ("... if the user agent
    cannot decode blob using encoding, then let charset be null"
    before the algorithmic steps, which essentially forces UTF-8).

    Can we list your three behaviors here, just so we get them on
    record?  Which behavior do you think is ideal?  More importantly,
    is substituting U+FFFD and "defaulting" to UTF-8 good enough for
    your scenario above?


The 3 off the top of my head were

1) Throw an exception. (content not valid for encoding)
2) Remap bad codes to some other value (sounds like that's the one above)
3) Remove the bad character

I see you've listed a 4th, "Ignore the encoding on error, assume utf-8". That one seems problematic because of partial reads. If you are decoding as shift-jis, have returned a partial read, and then later hit a bad code point, the stuff you've seen previously will all need to change by switching to no encoding.

I'd chose #2 which it sounds like is already the case according the spec.

This is the case in the spec. currently, but:

Regardless of what solution is chosen is there a way for me to know something was lost?


I don't think so, actually. And I'm not entirely sure how we can allow for such a way, unless we throw an error or something.


    2) If I'm reading using readAsText a multibyte encoding (utf-8,
    shift-jis, etc..) is it implementation dependent whether or not
    it can return partial characters when returning partial results
    during reading? In other words,  Let's say the next character in
    a file is a 3 byte code point but the reader has only read 2 of
    those 3 bytes so far. Is implementation dependent whether result
    includes those 2 bytes before reading the 3rd byte or not?


    Yes, partial results are currently implementation dependent; the
    spec. only says they SHOULD be returned.  There was reluctance to
    have MUST condition on partial file reads.  I'm open to revisiting
    this decision if the justification is a really good one.


I'm assuming by "MUST condition" you mean a UA doesn't have to support partial reads at all, not that how partial reads work shouldn't be specified.

Here's an example.

Assume we stick with unknown characters get mapped to U+FFFD.
Assume my stream is utf8 and in hex the bytes are.

E3 83 91 E3 83 91

That's 2 code points of 0x30D1. Now assume the reader has only read the first 5 bytes.

Should the partial results be

(a) filereader.result.length == 1 where the content is 0x30D1

 or should the partial result be

(b) filereader.result.length == 2 where the content is 0x30D1, 0xFFFD because at that point the E3 83 at the end of the partial result is not a valid codepoint

I think the spec should specify that if the UA supports partial reads it should follow example (a)

OK. I think the spec. needs more bolstering here. Thanks for your example. This makes it clearer.

-- A*

Reply via email to