Re: [whatwg] Accessing local files with JavaScript portably and securely
On Wed, Apr 19, 2017 at 8:23 AM, Roger Hågensenwrote: > On 2017-04-19 11:28, Anne van Kesteren wrote: > >> I already pointed to https://wicg.github.io/entries-api/ as a way to >> get access to a directory of files and as a way to >> get access to a sequence of files. Both for read access. I haven't >> seen any interest to go beyond that. >> > > Is this the Filesystem & FileWriter API ? > A small subset of the functionality specified in FileSystem was used by Chrome to expose directory upload. Support for that subset necessary for interop of directory upload has been implemented by Firefox and Edge. I put up the entries-api spec to try and re-specify just that subset. (It's a work in progress.) > This was added to Chrome/Opera under the webkit prefix 7 years ago, Edge > and Firefox has not picked this up yet (just the Reader part). > (as shown by http://caniuse.com/#search=file ) > The market apparently demonstrates that a sandboxed file system storage API isn't high priority for browser vendors to implement. > > I avoid prefixed features, and try to use only features that latest > Edge/Chrome/Firefox support so that end users are more likely to not end up > in a situation where their browser do not support a app. > > And unless I remember wrong Firefox did support this at some point then > removed it again. > > > Take for example my soundbank app. > > A end user would want to either use a file selector or drag'n'drop to the > app (browser) window to add files to the soundboard. > > Let us assume that 30+ sounds are added (I don't even think the > filerequester handles multiselection properly in all browsers today) > > Would it be fair to expect the user to have to re-add these each time they > start/open the app? During a week that is a lot of pointless work. > Saving filenames is not practical, and even if it was there would be no > paths. > > And storing the sounds in IndexDB or localStorage is out of the question > as that is limited to a total of 5MB or even less in most browsers, 30+ > samples easily consumes that. > You may want to check again. An origin typically gets an order of magnitude more storage than that for Indexed DB across browsers and devices. > > The ideal here is to make a html soundboard app locally (i.e file://) then > copy it as is to a webserver. Users can either use it from there (http:// > or https:// online and/or offline) or "Save As" the document and use it > locally (file://) for preservation or offline use without server dependency. > > The only way to make this work currently is to make the user hand write > the path (full or relative) to each sound and store that in localStorage > along with volume and fade in/out. > But fade in and out is "faked" by adjusting the volume as CORS > prevents processing the audio and doing a proper crossfade between sounds > which is possible but locked down due to CORS. > > I can understand limitations due to security concerns, but arbitrary > limitations to functionality baffles me. > > I do not see much difference between file:// http(s):// besides one > allowing serverside data processing and http headers, but these days most > apps are entirely clientside. A sample editor can be written that is fully > clientside, even including mic recording normalizing, FX, the server is not > involved in any stage except delivering the .html file + a few lines of > headers. The web app itself is identical (i.e. hash/checksum identical) be > it http(s): or file: > > The benefit is that "the app is the source code" which is a ideal goal of > open source as anyone can review and copy and modify as they please. > And in theory it could run just as well truly offline/standalone as it > could online without the need for a local webserver or similar. > > I'd dare say that thinking of a web app as something hosted only from a > server via http(s) is a antiquated idea. > These days a "web" app can be hosted via anything, want to open a webapp > that is served from a cloud storage like Dropbox? Not a problem. > Well, almost not a problem. a cloud storage probably do not have the > proper CORS header to allow a sample editor to process sound from local > files or files stored on a different cloud service. > > And a soundboard or a sample editor is just two examples, a image or video > edit would have similar issues. OR what about a game with mod support? > Being able to drag'n'drop a mod onto a game and then have the game load it > the next time you start the game would be a huge benefit. > But currently this can not be done, the mod would have to be uploaded to > the server the game is served from, even if the game itself does not use or > need any serverside scripting. > > Or imagine a medical app that needs to read in CSV data, such a app could > work fully offline/local and load up the data each time it's started. > Storing the data in localstorage/indexDB would be limited to whatever else > is stored as far as size
Re: [whatwg] Persistent and temporary storage
On Mon, Mar 16, 2015 at 1:38 AM, Anne van Kesteren ann...@annevk.nl wrote: On Fri, Mar 13, 2015 at 5:06 PM, Joshua Bell jsb...@chromium.org wrote: A handful of us working on Chrome have been having similar discussions around what we've been calling durable storage. In its simplest model a bit granted by the user to an origin, which then requires explicit user action before the data might be cleared under storage pressure, so it sounds like our thinking is broadly aligned, although we're still exploring various possibilities and their implications for permission prompts, cleanup UI, behavior under pressure, etc. Yeah, same here, wiki page outlines a tentative plan. Gotcha. And thanks again for opening up this discussion! Similarly, we've been trying to keep this orthogonal from quota (either the UA's logic for assigning a quota to an origin quota, or possible standardized quota APIs), although the UA may use similar signals for granting permissions/assigning quota. I think we've come around in that we need to expose quota in some way to give developers some expectations to how much they can fetch and then store in best effort mode. I think that matches our latest discussions too... But that for persistent it can be the whole disk. ... and we're waffling on that one. Going that far implies that the UA does a really good job on its own or with user interaction to respond when the storage is indeed getting full. Mobile OSes typically provide UI to inspect how much storage is in use and clear apps and/or portions of their storage. IMHO, we need to fully develop that UX in the UA before I'd be comfortable letting sites easily consume the whole disk. But we realize that artificially capping disk usage is a gap between web and native, and so solving that problem is high priority for us. And I don't think there are spec/standards implications here so we can move fast on the UA side, as long as we spec that QuotaExceededError can happen on various operations regardless of permissions, because even unlimited quota can be constrained by physical limits. (FYI, we've been using durable and non-durable to distance the discussion from the now-loaded temporary vs. persistent terms which surfaced in earlier API proposals, some of which are implemented in Chrome) Ah right. Current set of terms I have is best effort (default; fixed quota), persistent (requires some kind of user opt-in, probably through an API-triggered dialog, but maybe also done if you pin a tab or bookmark or some such; 'unlimited' quota), and temporary (exists outside of best effort/persistent, e.g. for storing social network resources, other volatile assets, requires some kind of API opt-in; fixed quota). If I'm reading the wiki page correctly, I'm intrigued by the temporary proposal. To confirm, you're envisioning a completely new lightweight storage API and there's no implied addition to the other storage APIs? If so... well, pros and cons. I'm not a huge fan of adding Yet Another Storage API. On the other hand, I'd rather do that then fork the existing storage APIs into temp/persistent and try and shoehorn priorities into those. If it helps I did a thought experiment a while ago on what would a stripped-down, Promise-based IDB-lite look like? at https://gist.github.com/inexorabletash/c8069c042b734519680c - it doesn't have the priority scheme, but that would be easy to add at the 'open' entry point. ... One thing we should discuss under the storage umbrella is how atomically we treat all storage for an origin. Customers we've talked to acknowledge the reality that even durable storage can be wiped in the face of user action (e.g. via settings UI to clear cookies etc) or file corruption. One of the situations they're concerned about is dealing with partial clearing of data, e.g. Indexed DB databases are present but the SW cache has been wiped, or vice versa. Currently, for quota-based storage eviction, we evict an origin's entire storage at once - that's easiest for sites to reason about, since it matches the first time user or returning user on new device scenarios that must already be supported. If we're taking a step back to think of storage as a whole, we may want to provide more spec-level assurance in this area. -- https://annevankesteren.nl/
Re: [whatwg] Persistent and temporary storage
Very timely! A handful of us working on Chrome have been having similar discussions around what we've been calling durable storage. In its simplest model a bit granted by the user to an origin, which then requires explicit user action before the data might be cleared under storage pressure, so it sounds like our thinking is broadly aligned, although we're still exploring various possibilities and their implications for permission prompts, cleanup UI, behavior under pressure, etc. Similarly, we've been trying to keep this orthogonal from quota (either the UA's logic for assigning a quota to an origin quota, or possible standardized quota APIs), although the UA may use similar signals for granting permissions/assigning quota. (FYI, we've been using durable and non-durable to distance the discussion from the now-loaded temporary vs. persistent terms which surfaced in earlier API proposals, some of which are implemented in Chrome) On Fri, Mar 13, 2015 at 7:25 AM, Janusz Majnert j.majn...@samsung.com wrote: On 13.03.2015 15:01, Anne van Kesteren wrote: On Fri, Mar 13, 2015 at 2:58 PM, Janusz Majnert j.majn...@samsung.com wrote: The real question is why having a quota is useful? The reason developers want it is to know how much they can download and store without getting an exception. Which still doesn't guarantee they won't get an exception if the device runs out of space for whatever reason. Native apps are not controlled when it comes to storing data and nobody complains. Is there any documentation on how they handle the above scenario? Just write to disk until you hit failure? I think so. This is certainly the case with desktop apps. I also didn't find any mention of quota in Android download manager docs ( http://developer.android.com/reference/android/app/DownloadManager.html) or in Tizen's Download API (https://developer.tizen.org/ dev-guide/2.3.0/org.tizen.mobile.native.apireference/ group__CAPI__WEB__DOWNLOAD__MODULE.html) Regards, -- Janusz Majnert Senior Software Engineer Samsung RD Institute Poland Samsung Electronics
[whatwg] AppCache error event details
We'd like to move forward on adding event details to AppCache errors. [1] I've fleshed out a proposal [2] that details the additions to the HTML spec. It introduces a new event type (ApplicationCacheErrorEvent) with reason (an enum), url, status, and message fields. Feedback would be appreciated, particularly about what level of information is safe to expose about cross-origin resource fetches which Chrome does support with AppCache (as Hixie mentioned in [3]). The fact that a fetch failure occurred with a specific URL does not appear to be a new piece of information, but presumably further details (e.g. the specific http response code, more detailed message) should not be exposed? [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=22702 [2] https://docs.google.com/document/d/1nlk7WgRD3d0ZcfK1xrwBFVZ3DI_e44j7QoMd5gAJC4E/edit?usp=sharing [3] https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/blSfs6IqcvY/jCfGdH3p8eAJ
Re: [whatwg] BinaryEncoding for Typed Arrays using window.btoa and window.atob
On Mon, Aug 12, 2013 at 4:50 PM, Glenn Maynard gl...@zewt.org wrote: On Mon, Aug 12, 2013 at 12:16 PM, Joshua Bell jsb...@google.com wrote: To recap history: early iterations of the Encoding API proposal did have base64 but it was removed with the suggestion to extend atob()/btoa() instead, and due to the confusion around the encode/decode verbs. If the APIs were something like StringToBytesConverter::convert() and BytesToStringConverter::convert() it would make more sense for encoding of both text (use StringToBytes) and binary data (use BytesToString). I thought about suggesting something like StringToBytes, but that seems less obvious for the (probably) more common usage of encoding/decoding a String, and it's still a bit off (though not *strictly* wrong) for converting to UTF-16, UTF-32, etc. I tend to think the slightly unintuitive names of TextEncoder and TextDecoder aren't bad enough that it's worth renaming them. For completeness, it's also worth bringing up https://developer.mozilla.org/en-US/docs/Code_snippets/StringView which started this round of discussion (over on blink-dev) which is another more neutral API design for binary/string data interop. I haven't read it deeply, but it looks like it doesn't handle the streaming case, but does explicitly tackle base64 without overloading text encoding methods. While we're re-opening this can of worms, there's been a request to add a flush() method to the TextEncoder/TextDecoder objects, which would behave the same as calling encode(null, {stream: false}) / decode(null, {stream:false}) but make the code more readable. This fails the adding a new method for something that behaves exactly like something we already have test. Opinions? I think you only need to say encode() and decode(), which is less of a win, especially since creating two ways of doing the same thing means that people have to learn both ways. Otherwise, they'll see code end with .encode() and not realize that it's the same as the .finish() they've been using. True. (I need to go back through this and other feedback that's trickled in and see if I'm mis-representing it, and see if there's anything else lingering.) On Mon, Aug 12, 2013 at 6:26 PM, Jonas Sicking jo...@sicking.cc wrote: I don't think that base64 encoding fits with the current TextEncoder/Decoder API. Not because of names, but because base64 encoding is by nature opposite. I.e. the encoded format is in string form, whereas the decoded format is in binary form. The names are the only things that are opposite. TextEncoder is just a streaming String-to-binary-blob conversion API, and TextDecoder is just a streaming binary-blob-to-String API, and that's precisely what base64 encoding and decoding are. That's the same whether you're converting String-to-base64 or String-to-UTF-8. The only difference is that the names we've given to those ideas are reversed here. Yes. One thing that might need special attention is that U+FFFD error handling doesn't make sense for base64; errors should probably always be fatal. Excellent point. ... I believe we may experiment with api-base64 and see if there are other gotchas beyond this and the naming. -- Glenn Maynard
Re: [whatwg] BinaryEncoding for Typed Arrays using window.btoa and window.atob
Back from a vacation, sorry about the late reply - hopefully still useful. On Wed, Aug 7, 2013 at 3:02 PM, Glenn Maynard gl...@zewt.org wrote: On Wed, Aug 7, 2013 at 4:21 PM, Chang Shu csh...@gmail.com wrote: If we plan to enhance the Encoding spec, I personally prefer a new pair of BinaryDecoder/BinaryEncoder, which will be less confusing than reusing TextDecoder/TextEncoder. I disagree with the idea of adding a new method for something that behaves exactly like something we already have, just to give it a different name. (It may not be too late to rename those functions, if nobody has implemented them yet, but I'm not convinced it's much of a problem.) FWIW, I've landed an experimental (behind a flag) implementation of the API in Blink/Chromium; changing it is definitely possible for us. I believe Moz is shipping it web-exposed already in FF? To recap history: early iterations of the Encoding API proposal did have base64 but it was removed with the suggestion to extend atob()/btoa() instead, and due to the confusion around the encode/decode verbs. If the APIs were something like StringToBytesConverter::convert() and BytesToStringConverter::convert() it would make more sense for encoding of both text (use StringToBytes) and binary data (use BytesToString). While we're re-opening this can of worms, there's been a request to add a flush() method to the TextEncoder/TextDecoder objects, which would behave the same as calling encode(null, {stream: false}) / decode(null, {stream:false}) but make the code more readable. This fails the adding a new method for something that behaves exactly like something we already have test. Opinions?
Re: [whatwg] Adding a btoa overload that takes Uint8Array
On Mon, Mar 4, 2013 at 9:09 AM, Boris Zbarsky bzbar...@mit.edu wrote: The problem I'm trying to solve is sending Unicode text to consumers who need base64-encoded input. Right now the only sane way to do it (and I quote sane for obvious reasons) is something like the example at https://developer.mozilla.org/**en-US/docs/DOM/window.btoa#** Unicode_Stringshttps://developer.mozilla.org/en-US/docs/DOM/window.btoa#Unicode_Strings It seems like it would be better if the output of a TextEncoder could be passed directly to btoa. But for that we need an overload of btoa that takes a Uint8Array. FYI, I believe the last iteration on this topic ended with this message: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-June/036372.html i.e. consensus that base64 should stay out of the Encoding API, but that it would be nice to have some form of base64 / Typed Array conversion API. But there were no concrete proposals beyond my strawman in that post. So: agreed, have at it!
Re: [whatwg] Encoding: API
On Thu, Oct 18, 2012 at 1:49 AM, Anne van Kesteren ann...@annevk.nl wrote: I added the API to the Encoding Standard: http://encoding.spec.whatwg.org/#api Feedback welcome. I suppose we might want to write an introduction for it too. Thanks, Anne! Excellent cleanup, too. On Thu, Oct 11, 2012 at 6:37 PM, Joshua Bell jsb...@chromium.org wrote: It sounds like there are several desirable behaviors: 1. ignore BOM handling entirely (BOM would be present in output, or fatal) 2. if matching BOM, consume; otherwise, ignore (mismatching BOM would be present in output, or fatal) 3. switch encoding based on BOM (any of UTF-8, UTF-16LE, UTF-16BE) 4. switch encoding based on BOM if-and-only-if UTF-16 explicitly specified, and only to one of the UTF-16 variants I went with supporting just 2 for now. 4 seems weird. As per IRC discussion, if someone wants to implement this functionality it is fairly simple from script. On Thu, Oct 18, 2012 at 11:24 PM, Anne van Kesteren ann...@annevk.nlwrote: On Thu, Oct 18, 2012 at 4:16 PM, Glenn Maynard gl...@zewt.org wrote: On Thu, Oct 18, 2012 at 3:54 AM, Anne van Kesteren ann...@annevk.nl wrote: * TextDecoder.decode()'s view argument is no longer optional. Why should it be? It buffers the EOF byte when in streaming mode, eg. when the last byte of the stream is a UTF-8 continuation byte, so any encode errors are triggered. * TextEncoder.encode()'s input argument is no longer nullable. Again, why should it be? Likewise for encoding, to flush errors for trailing high surrogates. I made these arguments optional now (and named them both input). Note however that the way you get the EOF byte/EOF code point is by omitting the dictionary (whose stream member defaults to false), but I can see how not passing any arguments as a final call is convenient. https://github.com/whatwg/encoding/commit/39a201a5cdf43be3d49c6bac7952a0ecb225886b Yes, purely convenience. Otherwise you'd need to call: decoder.decode(buffer1, {stream: true}); decoder.decode(buffer2, {stream: true}); decoder.decode(new Uint8Array()); I also raised the issue of whether TextEncoder should really support utf-16/utf-16be as the encoding standard tries to deprecate non-utf-8 encodings. The whole point of this API is to support legacy file formats that use other encodings. (It's probably questionable to not support other encodings, too, eg. filenames in ZIP file headers, but starting out with Unicode is fine.) I thought it was mostly about reading legacy formats, but fair enough. Jonas did a straw poll via Twitter about whether enoding to UTF-16 was needed, and received positive feedback.
Re: [whatwg] Encoding: API
On Wed, Oct 10, 2012 at 6:42 AM, Anne van Kesteren ann...@annevk.nl wrote: Hey, I was wondering whether it would make sense to define http://wiki.whatwg.org/wiki/StringEncoding as part of http://encoding.spec.whatwg.org/ Tying them together makes sense to me anyway and is similar to what we do with URL, HTML, etc. No objection from me. As for the open issue, I think it would make sense if the encoding's name was returned. Label is just some case-insensitive keyword to get there. I tend to agree, as the label gives you no information you don't already have and the name can be at least a diagnostic. I also still think it's kinda yucky that this API has this gigantic hack around what the rest of the platform does with respect to the byte order mark. It seems really weird to not expose the same encode/decode that HTML/XML/CSS/etc. use. IMHO the API needs to support use cases: (1) code that wants to follow the behavior of the web platform with respect to legacy content (i.e. the desire to self-host), and (2) code that wants to parse files that are not traditionally web data, i.e. fragments of binary files, which don't have legacy behavior and where BOM taking priority would be surprising to developers. For #2, following the behavior of APIs like ICU with respect to BOMs is more sensible. I believe #2 is higher priority as long as it does not preclude #1, and #1 can be achieved by code that inspects the stream before handing it off to the decoder. Practically speaking, this would mean refactoring the combined spec so that the current BOM handling is defined for parsing web content outside of the API rather than requiring the API to hack around it. ... While we're here, any feedback from implementers? Mozilla is apparently quite far along. Any surprises or additional issues? Any initial feedback from users? I received feedback recently that the API is perhaps too terse right now when dealing with streaming content, and a more explicit decode(), decodeStream(), resetStream() might be more intelligible. Thoughts?
Re: [whatwg] StringEncoding open issues
On Fri, Aug 17, 2012 at 5:19 PM, Jonas Sicking jo...@sicking.cc wrote: On Fri, Aug 17, 2012 at 7:15 AM, Glenn Maynard gl...@zewt.org wrote: On Fri, Aug 17, 2012 at 2:23 AM, Jonas Sicking jo...@sicking.cc wrote: - If encoding is utf-16 and the first bytes match 0xFF 0xFE or 0xFE 0xFF then set current encoding to utf-16 or utf-16be respectively and advance the stream past the BOM. The current encoding is used until the stream is reset. - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF then set current encoding to utf-16, utf-16be or utf-8 respectively and advance the stream past the BOM. The current encoding is used until the stream is reset. This doesn't sound right. The effect of the rules so far would be that if you create a decoder and specify utf-16 as encoding, and the first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to utf-8 decoding. I think the scope of the otherwise is unclear, and this is meant to be otherwise (if encoding is not utf-16). Ah, that would make sense. It effectively means if encoding is not set. / Jonas I've attempted to distill the above into the spec in an algorithmic way: http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder English version: If you specify utf-16 you get endian-agnostic UTF-16 encoding support. Failing that, if your encoding matches your BOM it is consumed. Failing *that*, you get whatever behavior falls out of the decode algorithm (garbage, error, etc). The JS shim has *not* been updated yet. Only part of this edit has been live for the last few weeks - apologies to the Moz folks who were trying to understand what the half-specified internal useBOM flag was for. Any implementer feedback so far?
Re: [whatwg] StringEncoding open issues
On Mon, Sep 17, 2012 at 2:17 PM, Anne van Kesteren ann...@annevk.nl wrote: On Mon, Sep 17, 2012 at 11:13 PM, Joshua Bell jsb...@chromium.org wrote: I've attempted to distill the above into the spec in an algorithmic way: http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder English version: If you specify utf-16 you get endian-agnostic UTF-16 encoding support. Failing that, if your encoding matches your BOM it is consumed. Failing *that*, you get whatever behavior falls out of the decode algorithm (garbage, error, etc). Why would we want the API to work different from how it works in markup (with meta charset etc.)? Granted it's not super logical, but I don't really see why we should make it inconsistent and more complicated. That's how the spec started out, so a recap of this thread would give you the back-and-forth that led here. To summarize: Having the BOM in the content be higher priority than the coding selected by the developer was not seen as desirable (see earlier in the thread), and potentially a source of errors. Selecting encoding via BOM (in general, or to emulate meta charset, etc) was seen as something that could be done in user code if desired, but unexpected otherwise. Two desired behaviors remained: (1) developer need for BOM-specified endian-agnostic UTF-16 encoding similar to ICU's handling that distinguishes utf-16 from utf-16le, and (2) that matching BOMs should be consumed and not appear in the decoded data.
Re: [whatwg] StringEncoding open issues
On Wed, Aug 15, 2012 at 5:30 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Aug 14, 2012 at 12:34 PM, Joshua Bell jsb...@chromium.org wrote: - Create an encoder with TextDecoder() and if present a BOM will be respected (and consumed) otherwise default to UTF-8 Let's not default to autodetect Unicode formats. It encourages people to support UTF-16 when they may not mean to. If BOM detection for both UTF-8 and UTF-16 is wanted, I'd suggest something explicit, like utf-*. If the argument to the ctor is optional, I think the default should be purely UTF-8. Works for me. In the algorithm specified in the email, this simply removes the clause If encoding is not specified, set an internal useBOM flag - namely, only utf-16 gets the useBOM flag. I'll attempt to wedge this into the spec soon. This gets easier if we restrict to encoding UTF-8 which typically doesn't include BOMs. But it's looking like there's enough desire to keep UTF-16 encoding at the moment. Agree with just stripping it for now. UTF-8 sometimes does have a BOM, especially in Windows where applications sometimes use it to distinguish UTF-8 from ACP text files (which are just as common as ever--Windows has made no motion away from legacy encodings whatsoever). Good point. Ah, Notepad, my old friend... Stripping the BOM can cause those applications to misinterpret the files as ACP. Anyway, even if the encoding API gives a helper for this, figuring out how that works would probably be more effort for developers than just peeking at the ArrayBuffer for the BOM and adding it back in manually. (I'm pretty sure anybody who knows enough to pay attention to this in the first place will have no trouble doing that.) So, yeah, let's not worry about this. -- Glenn Maynard
Re: [whatwg] StringEncoding open issues
On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard gl...@zewt.org wrote: I agree with Jonas that encoding should just use a replacement character (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off other modes (eg. exceptions and user-specified replacement characters) until there's a clear need. My intuition is that encoding DOMString to UTF-16 should never have errors; if there are dangling surrogates, pass them through unchanged. There's no point in using a placeholder that says an error occured here, when the error can be passed through in exactly the same form (not possible with eg. DOMString-SJIS). I don't feel strongly about this only because outputting UTF-16 is so rare to begin with. On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell jsb...@chromium.org wrote: - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes the byte order mark (the encoding-specific serialization of U+FEFF). This rarely detects the wrong type, but that doesn't mean it's not the wrong answer. If my input is meant to be UTF-8, and someone hands me BOM-marked UTF-16, I want it to fail in the same way it would if someone passed in SJIS. I don't want it silently translated. On the other hand, it probably does make sense for UTF-16 to switch to UTF-16BE, since that's by definition the original purpose of the BOM. The convention iconv uses, which I think is a useful one, is decoding from UTF-16 means try to figure out the encoding from the BOM, if any, and UTF-16LE and UTF-16BE mean always use this exact encoding. Let me take a crack at making this into an algorithm: In the TextDecoder constructor: - If encoding is not specified, set an internal useBOM flag - If encoding is specified and is a case insensitive match for utf-16 set an internal useBOM flag. NOTE: This means if utf-8, utf-16le or utf-16be is explicitly specified the flag is not set. When decode() is called - If useBOM is set and the stream offset is 0, then - If there are not enough bytes to test for a BOM then return without emitting anything (NOTE: if not streaming an EOF byte would be present in the stream which would be a negative match for a BOM) - If encoding is utf-16 and the first bytes match 0xFF 0xFE or 0xFE 0xFF then set current encoding to utf-16 or utf-16be respectively and advance the stream past the BOM. The current encoding is used until the stream is reset. - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF then set current encoding to utf-16, utf-16be or utf-8 respectively and advance the stream past the BOM. The current encoding is used until the stream is reset. - Otherwise, if useBOM is not set and the steam offset is 0, then if the encoding is utf-8, utf-16 or utf-16be - If the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF then let detected encoding be utf-16, utf-16be or utf-8 respectively. If the detected encoding matches the object's encoding, advance the stream past the BOM. Otherwise, if the fatal flag is set then throw a EncodingError DOMException. Otherwise, the decoding algorithm proceeds. - If there are not enough bytes to test for a BOM then return without emitting anything (NOTE: if not streaming an EOF byte would be inserted which would be a negative match for a BOM) Working the current encoding switcheroo into the spec will require some refactoring, so trying to get consensus here first. In English: - Create an encoder with TextDecoder() and if present a BOM will be respected (and consumed) otherwise default to UTF-8 - Create an encoder with TextDecoder(utf-16) and either UTF-16LE or UTF-16BE BOM will be respected (and consumed) otherwise default to UTF-16LE (which may decode garbage if UTF-8 BOM or other non-UTF-16 data is present) - Create an encoder with TextDecoder(utf-8, {fatal:true}), TextDecoder(utf-16le, {fatal:true}), TextDecoder(utf-16be, {fatal:true}) and a matching BOM will be consumed, a mismatching BOM will throw an EncodingError - Create an encoder with TextDecoder(utf-8), TextDecoder(utf-16le), TextDecoder(utf-16be) and a matching BOM will be consumed, a mismatching BOM will be blithely decoded (probably giving you replacement characters), but not throwing. * If one of the UTF encodings is specified AND the BOM matches then the leading BOM character (U+FEFF) MUST NOT be emitted in the output character sequence (i.e. it is silently consumed) It's a little weird that data = readFile(user-supplied-file.txt); // shortcutting for brevity var s = new TextDecoder(utf-16).decode(data); // or utf-8 s = s.replace(a, b); var data2 = new TextEncoder(utf-16).encode(s); writeFile(user-supplied-file.txt, data2); causes the BOM to be quietly stripped away. Normally if you're modifying a file, you want to pass through the BOM (or lack
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
Sorry if this is a dupe; I replied to this from my phone and an incorrect address, and my earlier reply isn't showing in the archives. On Fri, Aug 10, 2012 at 9:16 PM, Jonas Sicking jo...@sicking.cc wrote: The spec now contains the following text: NOTE: Because only UTF encodings are supported, and because of the algorithm used to convert a DOMString to a sequence of Unicode characters, no input can cause the encoding process to emit an encoder error. This is not correct. A DOMString is not a sequence of Unicode characters, it's a UTF16 encoded string (this is per EcmaScript). Thus it can contain unpaired surrogates and so the encoding process can result in encoder errors. As I've suggested earlier, I think we should deal with this by simply emitting Unicode replacement characters for these encoder errors (i.e. for unpaired surrogates). Already accounted for. Note the phrase: and because of the algorithm used to convert a DOMString to a sequence of Unicode characters This refers to the normative text that generates a sequence of Unicode code points from a DOMString by reference to the algorithm in WebIDL [1], which handles unpaired surrogates etc. This informative text should say Unicode code points rather than Unicode characters, though. Fixing now and referenced [1] even in the note. [1] http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell jsb...@chromium.org wrote: On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote: On 08/07/2012 07:51 PM, Jonas Sicking wrote: I don't mind supporting *decoding* from basically any encoding that Anne's spec enumerates. I don't see a downside with that since I suspect most implementations will just call into a generic decoding backend anyway, and so supporting the same set of encodings as for other parts of the platform should be relatively easy. [...] However I think we should consider restricting support to a smaller set of encodings for while *encoding*. There should be little reason for people today to produce text in non-utf formats. We might even be able to get away with only supporting UTF8, though I wouldn't be surprised if there are reasonably modern file formats which use utf16. FWIW, I agree with the decode-from-all-platform-**encodings encode-to-utf[8|16] position. Any disagreement on limiting the supported encodings to utf-8, utf-16, and utf-16be, while permitting decoding of all encodings in the Encoding spec? (This eliminates the what to do on encoding error issue nicely, still need to resolve the BOM issue though.) http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE. I'm tempted to take it further to just UTF-8 and see if anyone complains. Jury is still out on the decode-with-BOM issue - I need to reason through Glenn's suggestions on the open issues thread. I added a related open issue raised by Glenn, summarized as ... suggest that the .encoding attribute simply return the name that was passed to the constructor. - taking this further, perhaps the attribute should be eliminated as callers could apply it themselves.
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote: On 08/07/2012 07:51 PM, Jonas Sicking wrote: I don't mind supporting *decoding* from basically any encoding that Anne's spec enumerates. I don't see a downside with that since I suspect most implementations will just call into a generic decoding backend anyway, and so supporting the same set of encodings as for other parts of the platform should be relatively easy. [...] However I think we should consider restricting support to a smaller set of encodings for while *encoding*. There should be little reason for people today to produce text in non-utf formats. We might even be able to get away with only supporting UTF8, though I wouldn't be surprised if there are reasonably modern file formats which use utf16. FWIW, I agree with the decode-from-all-platform-**encodings encode-to-utf[8|16] position. Any disagreement on limiting the supported encodings to utf-8, utf-16, and utf-16be, while permitting decoding of all encodings in the Encoding spec? (This eliminates the what to do on encoding error issue nicely, still need to resolve the BOM issue though.)
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Tue, Aug 7, 2012 at 8:32 AM, Glenn Maynard gl...@zewt.org wrote: On Mon, Aug 6, 2012 at 11:39 PM, Jonas Sicking jo...@sicking.cc wrote: I seem to have a recollection that we discussed only allowing encoding to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats as well as stay in sync with other APIs like XMLHttpRequest. It looks like the relevant discussion was at http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html It doesn't appear we reached consensus - there was some desire expressed to scope to UTF-8, then perhaps expand to include UTF-16, definite consensus that any encoding supported should be handled by both encode and decode, then comments about XHR and form data encodings, but then the discussion wandered into stateful vs. stateless encodings which took us off topic. So Glenn's comment below pretty much reboots the conversation where it was: Not an objection, but where does XHR limit sent data to those encodings? send(FormData) forces UTF-8 (which is even more restrictive); send(Document) seems to allow any encoding *except* for UTF-16 (presumably web compat since that's a weird criteria). I'm not sure that staying in sync with XHR--which has its own pile of legacy code to support--is worthwhile here anyway, but limiting to Unicode seems fine in its own right, especially since the restriction can always be lifted later if real needs come up. However I currently can't find any restrictions on which target encodings are supported in the current drafts. When Anne's spec appeared I gutted mine and deferred wherever possible to his. One consequence of that was getting the other encodings for free as far as the spec writing goes. If we achieve consensus that we only want to support UTF encodings we can add the restrictions. There are use cases for supporting other encodings (parsing legacy data file formats, for example), but that could be deferred. One wrinkle in this is if we want to support arbitrary encodings when encoding, that means that we can't use insert a the replacement character as default error handling since that isn't available in a lot of encoding formats. I don't think this part is a real hurdle. Just replace with ? for non-Unicode encodings. On Tue, Aug 7, 2012 at 8:10 AM, Joshua Cranmer pidgeo...@verizon.netwrote: I found that the wiki version of the proposal cites http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html as the way to find encodings. That spec documents the encodings which are used anywhere in the platform, but that doesn't necessarily mean every API needs to support all those encodings. It's almost all backwards-compatibility. There are also cross-browser differences in handling decoding of certain code points in certain encodings. Exposing those encodings in a new API would either require that the browser vendors expose those differences (bleah) or implement a compatibility switch in the affected codecs (bleah).
Re: [whatwg] StringEncoding: encode() return type looks weird in the IDL
On Sun, Aug 5, 2012 at 11:44 AM, Boris Zbarsky bzbar...@mit.edu wrote: On 8/5/12 1:39 PM, Glenn Maynard wrote: I didn't say it was extensibility, just a leftover from something that was either considered and dropped or forgotten about. Oh, I see. I thought you were talking about leaving the return value as-is so that Uint16Array return values can be added later. I'd vote for changing the return type to Uint8Array as things stand, and if we ever change what the function can return, we change the return type at that point. Thanks. Yes, having the return type be ArrayBufferView in the IDL is just a leftover. Fixing it now to be Uint8Array. I'll start another thread on StringEncoding shortly summarizing open issues, but anyone reading this thread is encouraged to take a look at http://wiki.whatwg.org/wiki/StringEncoding and craft opinions.
[whatwg] StringEncoding open issues
Regarding the API proposal at: http://wiki.whatwg.org/wiki/StringEncoding It looks like we've got some developer interest in implementing this, and need to nail down the open issues. I encourage folks to look over the Resolved issues in the wiki page and make sure the resolutions - gathered from loose consensus here and offline discussion - are truly resolved or if anything is not future-proof and should block implementations from proceeding. Also, look at the Notes to Implementers section; this should be non-controversial but may be non-obvious. This leaves two open issues: behavior on encoding error, and handling of Byte Order Marks (BOMs) == Encoding Errors == The proposal builds on Anne's http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec, which defines when encodings should emit an encoder error. In that spec (which describes the existing behavior of Web browsers) encoders are used in a limited fashion, e.g. for encoding form results before submission via HTTP, and hence the cases are much more restricted than the errors encountered when browsers are asked to decode content from the wild. As noted, the encoding process could terminate when an error is emitted. Alternately (and as is necessary for forms, etc) there is a use-case-specific escaping mechanism for non-encodable code points. The proposed TextDecoder object takes a TextDecoderOptions options with a |fatal| flag that controls the decode behavior in case of error - if |fatal| is unset (default) a decode error produces a fallback character (U+FFFD); if |fatal| is set then a DOMException is raised instead. No such option is currently proposed for the TextEncoder object; the proposal dictates that a DOMException is thrown if the encoder emits an error. I believe this is sufficient for V1, but want feedback. For V2 (or now, if desired), the API could be extended to accept an options object allowing for some/all of these cases; * Don't throw, instead emit a standard/encoding-specific replacement character (e.g. '?') * Don't throw, instead emit a fixed placeholder character (byte?) sequence * Don't throw, instead call a user-defined callback and allow it to produce a replacement escaped character sequence, e.g. #x; The latter seems the most flexible (superset of the rest) but is probably overkill for now. Since it can be added in easily later, can we defer until we have implementer and user feedback? == Byte Order Marks (BOMs) == Once again, the proposal builds on Anne's http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec, which describes the existing behavior of Web browsers. In the wild, browsers deal with a variety of mechanisms for indicating the encoding of documents (server headers, meta tags, XML preludes, etc), many of which are blatantly incorrect or contradictory. One form is fortunately rarely wrong - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes the byte order mark (the encoding-specific serialization of U+FEFF). This is built into the Encoding spec - given a byte sequence to decode and an encoding label, the label is ignored if the sequence starts with one of the three UTF BOMs, and the BOM-indicated encoding is used to decode the rest of the stream. The proposed API will have different uses, so it is unclear that this is necessary or desirable. At a minimum, it is clear that: * If one of the UTF encodings is specified AND the BOM matches then the leading BOM character (U+FEFF) MUST NOT be emitted in the output character sequence (i.e. it is silently consumed) Less clear is this behavior in these two cases. * If one of the UTF encodings is specified AND and a different BOM is present (e.g. UTF-16LE but a UTF-16BE BOM) * If one of the non-UTF encodings is specified AND a UTF BOM is present Options include: * Nothing special - decoder does what it will with the bytes, possibly emitting garbage, possibly throwing * Raise a DOMException * Switch the decoder from the user-specified encoding to the DOM-specified encoding The latter seems the most helpful when the proposed API is used as follows: var s = TextDecoder().decode(bytes); // handles UTF-8 w/o BOM and any UTF w/ BOM ... but it does seem a little weird when used like this; var d = TextDecoder('euc-jp'); assert(d.encoding === 'euc-jp'); var s = d.decode(new Uint8Array([0xFE]), {stream: true}); assert(d.encoding === 'euc-jp'); assert(s.length === 0); // can't emit anything until BOM is definitely passed s += d.decode(new Uint8Array([0xFF]), {stream: true}); assert(d.encoding === 'utf-16be'); // really?
Re: [whatwg] StringEncoding: encode() return type looks weird in the IDL
On Sun, Aug 5, 2012 at 10:29 AM, Glenn Maynard gl...@zewt.org wrote: I guess the brokenness of Uint16Array (eg. the current lack of Uint16LEArray) could be sidestepped by just always returning Uint8Array, even if encoding to a 16-bit encoding (which is what it currently says to do). Maybe that's better anyway, since it avoids making UTF-16 a special case. +1 - which is why I pushed back on returning a Uint16Array earlier in the discussion. I guess that if you're converting a string to a UTF-16 ArrayBuffer, you're probably doing it to quickly dump it into a binary field somewhere anyway--if you wanted to *examine* the codepoints, you'd just look at the DOMString you started with. +1 again, and nicely stated. When I was a potential consumer of such an API, I was happy to treat the encoded form as a black box.
Re: [whatwg] binary encoding
On Tue, Jun 12, 2012 at 2:29 AM, Simon Pieters sim...@opera.com wrote: On Mon, 11 Jun 2012 18:20:55 +0200, Joshua Bell jsb...@chromium.org wrote: http://wiki.whatwg.org/wiki/**StringEncodinghttp://wiki.whatwg.org/wiki/StringEncoding defines a binary encoding (basically the official iso-8859-1 where it is not mapped to windows-1252). which is residue from earlier iterations. Intended use case was interop with legacy JS that used the lower 8 bits of strings to hold binary data, e.g. with APIs like atob()/btoa(). I think we should drop this and extend atob() and btoa() to be able to convert base64 strings to ArrayBuffer[View?] and back directly. Agreed (I wanted a little more consensus before removing it). Now that we can get binary data into script directly will there still be active use of base64 + ArrayBuffers that will benefit from platform support? Anyone want to tackle specifying the atob/btoa extensions? As a strawman: partial interface ArrayBufferView { DOMString toBase64(); }; partial interface ArrayBuffer { static ArrayBuffer fromBase64(DOMString string); }; These don't handle data streaming scenarios, however. (This is completely orthogonal to Anne's question about whether a binary encoding should be specified somewhere to describe current implementations.)
Re: [whatwg] binary encoding
On Mon, Jun 11, 2012 at 6:03 AM, Anne van Kesteren ann...@annevk.nl wrote: http://wiki.whatwg.org/wiki/StringEncoding ... hasn't been getting much attention from me recently. I'll recap the open issues and proposed resolutions to this list soonish. defines a binary encoding (basically the official iso-8859-1 where it is not mapped to windows-1252). which is residue from earlier iterations. Intended use case was interop with legacy JS that used the lower 8 bits of strings to hold binary data, e.g. with APIs like atob()/btoa(). Is it an idea to move that http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html somehow? On its own, this use case is probably not strong enough to merit slipping a pseudo-encoding into the platform, but...On its own, this use case is probably not strong enough to merit slipping a pseudo-encoding into the platform, but... I do not think we want to give it an officially supported label, but it does make some sense to define it using the same infrastructure. http://dvcs.w3.org/hg/xhr/raw-file/tip/Overview.html has the same need for converting certain types of DOMString. ... as there are other use cases then we should codify it. I have no preferences as to label; the proposed JS API could specify a label for it, but defer the specifics of the encoding to the Encoding spec. (I believe as written I currently call out the special case that BOM detection should never be done for binary which is already a special case, although BOM detection vis-a-vis the API is itself an open issue)
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
Any further input on Kenneth's suggestions? Re: ArrayBufferView vs. DataView - I'm tempted to make the switch to just DataView. As discussed below, data parsing/serialization operations will tend to be associated with DataViews. As Glenn has mentioned elsewhere recently, it is possible to accidentally do a buffer copy when mis-using typed array constructors, while DataView avoids this. DataViews are cheap to construct, and when I'm writing sample code for the proposed API I find I create throw-away DataViews anyway. Also, there is the potential for confusion when using a non-Uint8Array buffer e.g. are the elements being decoded using array[N] as the octets or using the underlying buffer? for Uint16Array/UTF-16 encodings, what are the endianness concerns? DataView APIs have an explicit endianness and no index getter, which alleviates this somewhat. Re: writing into an existing buffer - as Glenn says, most of the input earlier in the thread advocated strongly for very simple initial API with streaming support as the only fancy feature beyond the minimal string = foo.decode(buffer) / buffer = foo.encode(string). Adding details = foo.encodeInto(string, buffer) later on is not precluded if there is demand. Also, I am planning to move the fatal option from the encode/decode methods to the TextEncoder/TextDecoder constructors. Objections? On Tue, Mar 27, 2012 at 7:43 PM, Kenneth Russell k...@google.com wrote: On Tue, Mar 27, 2012 at 6:44 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Mar 27, 2012 at 7:12 PM, Kenneth Russell k...@google.com wrote: - I think it should reference DataView directly rather than ArrayBufferView. The typed array spec was specifically designed with two use cases in mind: in-memory assembly of data to be sent to the graphics card or audio device, where the byte order must be that of the host architecture; This is wrong, broken, won't be implemented this way by any production browser, isn't how it's used in practice, and needs to be fixed in the spec. It violates the most basic web API requirement: interoperability. Please see earlier in the thread; the views affected by endianness need to be specced as little endian. That's what everyone is going to implement, and what everyone's pages are going to depend on, so it's what the spec needs to say. Separate types should be added for big-endian (eg. Int16BEArray). Thanks for your input. The design of the typed array classes was informed by requirements about how the OpenGL, and therefore WebGL, API work; and from prior experience with the design and implementation of Java's New I/O Buffer classes, which suffered from horrible performance pitfalls because of a design similar to that which you suggest. Production browsers already implement typed arrays with their current semantics. It is not possible to change them and have WebGL continue to function. I will go so far as to say that the semantics will not be changed. In the typed array specification, unlike Java's New I/O specification, the API was split between two use cases: in-memory data construction (for consumption by APIs like WebGL and Web Audio), and file and network I/O. The API was carefully designed to avoid roadblocks that would prevent maximum performance from being achieved for these use cases. Experience has shown that the moment an artificial performance barrier is imposed, it becomes impossible to build certain kinds of programs. I consider it unacceptable to prevent developers from achieving their goals. I also disagree that it should use DataView. Views are used to access arrays (including strings) within larger data structures. DataView is used to access packed data structures, where constructing a view for each variable in the struct is unwieldy. It might be useful to have a helper in DataView, but the core API should work on views. This is one point of view. The true design goal of DataView is to supply the primitives for fast file and network input/output, where the endianness is explicitly specified in the file format. Converting strings to and from binary encodings is obviously an operation associated with transfer of data to or from files or the network. According to this taxonomy, the string encoding and decoding operations should only be associated with DataView, and not the other typed array types, which are designed for in-memory data assembly for consumption by other hardware on the system. - It would be preferable if the encoding API had a way to avoid memory allocation, for example to encode into a passed-in DataView. This was an earlier design, and discussion led to it being removed as a premature optimization, to simplify the API. I'd recommend reading the rest of the thread. I do apologize for not being fully caught up on the thread, but hope that the input above was still useful. -Ken
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Sat, Mar 24, 2012 at 6:52 AM, Glenn Maynard gl...@zewt.org wrote: On Thu, Mar 22, 2012 at 8:58 AM, Anne van Kesteren ann...@opera.com wrote: Another way would be to have a second optional argument that indicates whether more bytes are coming (defaults to false), but I'm not sure of the chances that would be used correctly. The reasons you outline are probably why many browser implementations deal with EOF poorly too. It might not improve it, but I don't think it'd be worse. If you didn't use it correctly for an encoding where it matters, the breakage would be obvious. Also, the previous automatically-streaming API has another possible misuse: constructing a single encoder, then calling it repeatedly for unrelated strings, without calling eof() between them (trailing bytes would become U+FFFD in the next string). That'd be a less likely mistake with this, too. Agreed. Simple things should be simple. Here's a suggestion, working from that: encoder = Encoder(euc-kr); view = encoder.encode(str1, {continues: true}); view = encoder.encode(str2, {continues: true}); view = encoder.encode(str3, {continues: false}); An alternative way to end the stream: encoder = Encoder(euc-kr); view = encoder.encode(str1, {continues: true}); view = encoder.encode(str2, {continues: true}); view = encoder.encode(str3, {continues: true}); view = encoder.encode(, {continues: false}); // or view = encoder.encode(); // equivalent; continues defaults to false // or view = encoder.encode(); // maybe equivalent, if the first parameter is optional The simplest usage is concise enough that we don't really need a separate str.encode() method: view = Encoder(euc-kr).encode(str); If it has an eof() method, it'd just be a literal wrapper for encoder.encode(), but it can probably be omitted. Agreed, I'd omit it. Bikeshed: The |continues| term doesn't completely thrill me; it's clear in context, but not necessarily what someone might go searching for. {eof:true} would be lovely except we want the default to be yes-EOF but a falsy JS value. |noEOF| ? If there aren't immediate objections, I'll update my wiki draft with this style of API, and see about updating my JS polyfill as well. Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ? One object type is simpler for the non-streaming case, e.g.: // somewhere globally g_codec = Encoding(euc-kr); // elsewhere... str = g_codec.decode(view); // okay view = g_codec.encode(str); // fine, no state captured str = g_codec.decode(view); // still okay but IMHO someone unfamiliar with the internals of encodings might extend the above into:: // somewhere globally g_codec = Encoding(euc-kr); // elsewhere in some stream handling code... str = g_codec.decode(view, {continues: true}); // okay.. view = g_codec.encode(str, {continues: true}); // sure, now both an encode and decode state are captured by codec str = g_codec.decode(view, {continues: true}); // okay only if this is more of the same stream; if there are two incoming streams, this is wrong The same mistake is possible with Encoder / Decoder objects, of course (you just need two globals). But something about separating them makes it clearer to me that the |continues| flag is affecting state in the object rather than just affecting the output of the call.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 2:42 PM, Anne van Kesteren ann...@opera.com wrote: On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell jsb...@chromium.org wrote: Bikeshed: The |continues| term doesn't completely thrill me; it's clear in context, but not necessarily what someone might go searching for. {eof:true} would be lovely except we want the default to be yes-EOF but a falsy JS value. |noEOF| ? Peter Beverloo suggests stream on IRC. I like it. +1 Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ? Two seems cleaner. I've gone ahead and updated the wiki/draft: http://wiki.whatwg.org/wiki/StringEncoding This includes: * TextEncoder / TextDecoder objects, with |encode| and |decode| methods that take option dicts * A |stream| option, per the above * A |nullTerminator| option eliminates the need for a stringLength method (hasta la vista, baby!) * |encodedLength| method is dropped since you can't in-place encode anyway * decoding errors yield fallback code points by default, but setting a |fatal| option cause a DOMException to be thrown instead * specified exceptions as DOMException of type EncodingError, as a placeholder New issues resulting from this refactor: * You can change the options (stream, nullTerminator, fatal) midway through decoding a stream. This would be silly to do, but as written I don't think this makes the implementation more difficult. Alternately, the non-stream options could be set on the TextDecoder object itself. * BOM handling needs to be resolved. The Encoding spec makes the encoding label secondary to the BOM. With this API it's unclear if that should be the case. Options include having a mismatching BOM throw, treating a mismatching BOM as a decoding error (i.e. fallback or throw, depending on options), or allow the BOM to actually switch the decoder used for this stream - possibly if-and-only-if the default encoding was specified. I've also partially updated the JS polyfill proof-of-concept implementation, tests, and examples as well, but it does not implement streaming yet (i.e. a stream option is ignored, state is always lost); I need to do a tiny bit more refactoring first.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 4:12 PM, Glenn Maynard gl...@zewt.org wrote: On Mon, Mar 26, 2012 at 4:49 PM, Joshua Bell jsb...@chromium.org wrote: * A |stream| option, per the above Does this make sense when you're using stream: false to flush the stream? It's still a streaming operation. I guess it's close enough. * A |nullTerminator| option eliminates the need for a stringLength method (hasta la vista, baby!) I strongly disagree with this change. It's much cleaner and more generic for the decoding algorithm to not know anything about null terminators, and to have separate general-purpose methods to determine the length of the string (memchr/wmemchr analogs, which we should have anyway). We made this simplification a long time ago--why did you resurrect this? Ah, I'd forgotten that there was consensus that doing this outside the API was preferable. I'll remove the option when I touch the spec again. * BOM handling needs to be resolved. The Encoding spec makes the encoding label secondary to the BOM. With this API it's unclear if that should be the case. Options include having a mismatching BOM throw, treating a mismatching BOM as a decoding error (i.e. fallback or throw, depending on options), or allow the BOM to actually switch the decoder used for this stream - possibly if-and-only-if the default encoding was specified. The path of fewest errors is probably to have a BOM override the specified UTF-16 endianness, so saying UTF-16BE just changes the default. This would apply on if the previous call had {stream: false} (implicitly or explicitly). Calling with {stream:false} would reset for the next call. Would it apply only to UTF-16 or UTF-8 as well? Should there be any special behavior when not specifying an encoding in the constructor? On Mon, Mar 26, 2012 at 4:27 PM, Jonas Sicking jo...@sicking.cc wrote: A few comments: * It appears that we lost the ability to measure how long a resulting buffer was going to be and then decode into the buffer. I don't know if this is an issue. True. On the plus side, the examples in the page (encode/decode array-of-strings) didn't change size or IMHO readability at all. * It might be a performance problem to have to check for the fatal/nullTerminator options on each call. No comment here. Moving the fatal and other options to the TextDecoding object rather than the decode() call is a possibility. I'm not sure which I prefer. * We lost the ability to decode from a arraybuffer and see how many bytes were consumed before a null-terminator was hit. One not terribly elegant solution would be to add a TextDecoder.decodeWithLength method which return a DOMString+length tuple. Agreed, but of course see above - there was consensus earlier in the thread that searching for null terminators should be done outside the API, therefore the caller will have the length handy already. Yes, this would be a big flaw since decoding at tightly packed data structure (e.g. array of null terminated strings w/o length) would be impossible with just the nullTerminator flag.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 6:24 PM, Glenn Maynard gl...@zewt.org wrote: I guess. It doesn't seem that important, since it's just a few lines of code. If this is done, I'd suggest that this helper API *not* have any special support for streaming (not to disallow it, but not to have any special handling for it, either). I think streaming has little overlap with null-terminated fields, since null-termination is typically used with fixed-size buffers. It would complicate things; for example, you'd need some way to signal to the caller that a null terminator was encountered. Agreed. Also worth relying to this thread is that in addition to null termination there have been requests for other terminators, such as 0xFF which is an invalid byte in a UTF-8 stream and thus a lovely terminator. Other byte sequences were mentioned. (This was over in the Khronos WebGL list for anyone who wants to dig it up. It was tracked as an unresolved ISSUE in the spec.) This supports the assertion that we should not special case null terminators, but instead provide general (and highly optimizable) utilities like memchr operating on buffers, since we can't anticipate every usage in higher-level APIs like the one under discussion.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 21, 2012 at 12:42 PM, Anne van Kesteren ann...@opera.comwrote: On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sicking jo...@sicking.cc wrote: This leaves us with 2 or 3. So the question is if we should support streaming or not. I suspect doing so would be worth it. For XMLHttpRequest it might be, yes. I think we should expose the same encoding set throughout the platform. One reason to limit the encoding set initially might be because we have not all converged yet on our encoding sets. Gecko, Safari, and Internet Explorer expose a lot more encodings than Opera and Chrome. Just to throw it out there - does anyone feel we can/should offer asymmetric encode/decode support, i.e. supporting more encodings for decode operations than for encode operations? As for the API, how about: enc = new Encoder(euc-kr) string1 = enc.encode(bytes1) string2 = enc.encode(bytes2) string3 = enc.eof() // might return empty string if all is fine And similarly you would have dec = new Decoder(shift_jis) bytes = dec.decode(string) Or alternatively you could have a single object that exposes both encode() and decode() and tracks state for both: enc = new Encoding(gb18030) bytes1 = enc.decode(string1) string2 = enc.encode(bytes2) That's the direction my thinking was headed. Glenn pointed out that the state that's implicitly captured in the above objects could instead be returned as an explicit but opaque state object that's passed in and out of stateless functions. As a potential user of the API, I find the above object-oriented style easier to understand. Re: Encoding object vs. an Encoder/Decoder pair - I'd prefer the latter as it makes the state being captured and any methods/attributes to interrogate the state clearer. Bikeshedding on the name - we'd have to put String or Text in there somewhere, since audio/video/image codecs will likely want to use similar terms.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 20, 2012 at 7:26 AM, Glenn Maynard gl...@zewt.org wrote: On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking jo...@sicking.cc wrote: Why are encodings different than other parts of the API where you indeed have to know what works and what doesn't. Do you memorize lists of encodings? I certainly don't. I look them up as needed. UTF8 is stateful, so I disagree. No, UTF-8 doesn't require a stateful decoder to support streaming. You decode up to the last codepoint that you can decode completely. The return values are the output data, the number of bytes output, and the number of bytes consumed; that's all you need to restart decoding later. That's the iconv(3) approach that we're probably all familiar with, which works with almost all encodings. ISO-2022 encodings are stateful: you have to persistently remember the character subsets activated by earlier escape sequences. An iconv-like streaming API is impossible; to support streamed decoding, you'd need to have a decoder object that the user keeps around in order to store that state. http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure Which seems like it leaves us with these options: 1. Only support encodings with stateless coding (possibly down to a minimum of UTF-8) 2. Only provide an API supporting non-streaming coding (i.e. whole strings/whole buffers) 3. Expand the API to return encoder/decoder objects that capture state Any others? Trying to do simplify the problem but take on both (1) and (2) without (3) would lead to an API that could not encompass (3) in the future, which would be a mistake. I'll throw out that the in-progress design of a Globalization API for ECMAScript - http://norbertlindenberg.com/2012/02/ecmascript-internationalization-api/ - is currently spec'd to both build on the existing locale-aware methods on String/Number/Date prototypes as conveniences, as well as introducing the Collator and *Format objects. Should we start with UTF-8-only/non-streaming methods on DOMString/ArrayBufferView, and avoid constraining a future API supporting multiple, possibly stateful encodings and streaming?
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote: On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote: What's the use-case for the stringLength function? You can't decode into an existing datastructure anyway, so you're ultimately forced to call decode at which point the stringLength function hasn't helped you. stringLength doesn't return the length of the decoded string. It returns the byte offset of the first \0 (or the length of the whole buffer, if none), for decoding null-terminated strings. For multibyte encodings (eg. everything except UTF-16 and friends), it's just memchr(), so it's much faster than actually decoding the string. And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. Currently the use-case of simply wanting to convert a string to a binary buffer is a bit cumbersome. You first have to call the encodedLength function, then allocate a buffer of the right size, then call the encode function. I suggested eg. result = encode(string, utf-8, null).output; which would create an ArrayBuffer of the required size. Presumably the null ArrayBufferView argument would be optional, so you could just say encode(string, utf-8). I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) If this instead is attached to String, it would look like: out_dict = my_string.encode(opt_dict); If it were attached to ArrayBufferView, having a right-size buffer allocated for the caller gets uglier unless we include a static version. It doesn't seem possible to implement the 'encode' function without doing multiple scans over the string. The implementation seems required both to check that the data can be decoded using the specified encoding, as well as check that the data will fit in the passed in buffer. Only then can the implementation start decoding the data. This seems problematic. Only if it guarantees that it doesn't write anything to the output buffer unless the entire result will fit. I don't think we need to do that; just guarantee that it'll be truncated on a whole codepoint. Agreed. Input/output dicts mean the API documentation a caller needs to read to understand the usage is more complex than a function signature which is why I resisted them, but it does seem like the best approach. Thanks for pushing, Glenn! In the create-a-buffer-on-the-fly case there will be some memory juggling going on, either by initially over allocating or reallocating/moving. I also don't think it's a good idea to throw an exception for encoding errors. Better to convert characters to the unicode replacement character. I believe we made a similar change to the WebSockets specification recently. Was that change made? I filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems to be undecided. Settling on an options dict means adding a flag to control this behavior (throws: true ?) doesn't extend the API surface significantly.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote: And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. ... and the spec should include normative guidance that length-prefixing is strongly recommended for new data formats.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 10:35 AM, Glenn Maynard gl...@zewt.org wrote: On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote: ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) Uint8Array is correct. (Uint8ClampedArray is for image color data.) If UTF-16 or UTF-32 are supported, decoding to them should return Uint16Array and Uint32Array, respectively (with the return value being typed just to ArrayBufferView). FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness - the above would imply that either platform endianness dictated the output byte sequence (and le/be was ignored), or that encode(\uFFFD, utf-16).view[0] might != 0xFFFD on some platforms. There was consensus (among the two of us) that the output view's underlying buffer's byte order would be le/be depending on the selected encoding. There is not consensus over what the return view type should be - Uint8Array, or pursue BE/LE variants of Uint16Array to conceal platform endianness.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
FYI, I've updated http://wiki.whatwg.org/wiki/StringEncoding * Rewritten in terms of Anne's Encoding spec and WebIDL, for algorithms, encodings, and encoding selection, which greatly simplifies the spec. This implicitly adds support for all of the other encodings defined therein - we may still want to dictate a subset of encodings. A few minor issues noted throughout the spec. * Define a binary encoding, since that support was already in this spec. We may decide to kill this but I didn't want to remove it just yet. * Simplify methods to take ArrayBufferView instead of any/byteOffset/byteLength. The implication is that you may need to use temporary DataViews, and this is reflected in the examples. * Call out more of the big open issues raised on this thread (e.g. where should we hang this API) Nothing controversial added, or (alas) resolved.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 14, 2012 at 3:53 PM, Glenn Maynard gl...@zewt.org wrote: It's more than a naming problem. With this string API, one side of the conversion is always a DOMString. Base64 conversion wants ArrayBuffer-ArrayBuffer conversions, so it would belong in a separate API. Huh. The scenarios I've run across are Base64-encoded binary data islands embedded in textual container formats like XML or JSON, which yield a DOMString I want to decode into an ArrayBuffer.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). There was discussion about this before: https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html http://wiki.whatwg.org/wiki/StringEncoding (I don't know why it was on the WebGL list; typed arrays are becoming infrastructural and this doesn't seem like it belongs there, even though ArrayBuffer was started there.) The API on that wiki page is a reasonable start. For the same reasons that we discussed in a recent thread ( http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html), conversion errors should use replacement (eg. U+FFFD), not throw exceptions. The any arguments should be fixed. Encoding to UTF-16 should definitely not prefix a BOM, and UTF-16 having unspecified endianness is obviously bad. I'd also suggest that, unless there's serious, substantiated demand for it--which I doubt--only major Unicode encodings be supported. Don't make it easier for people to keep using legacy encodings. Two other pieces of feedback I received from Adam Barth off list: * take ArrayBufferView as input which both fixes any and simplifies the API to eliminate byteOffset and byteLength * support two versions of encode, one which takes a target ArrayBufferView, and one which allocates/returns a new Uint8Array of the appropriate length. Shouldn't this just be another ArrayBufferView type with special semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a getString()/setString() method pair on DataView? I don't think so, because retrieving the N'th decoded/reencoded character isn't a constant-time operation. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). There was discussion about this before: https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html http://wiki.whatwg.org/wiki/StringEncoding (I don't know why it was on the WebGL list; typed arrays are becoming infrastructural and this doesn't seem like it belongs there, even though ArrayBuffer was started there.) Purely historical; early adopters of Typed Arrays were folks prototyping with WebGL who wanted to parse data files containing strings. WHATWG makes sense, I just hadn't gotten around to shopping for a home. (Administrivia: Is there need to propose a charter addition?) The API on that wiki page is a reasonable start. For the same reasons that we discussed in a recent thread ( http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html), conversion errors should use replacement (eg. U+FFFD), not throw exceptions. The any arguments should be fixed. Encoding to UTF-16 should definitely not prefix a BOM, and UTF-16 having unspecified endianness is obviously bad. I'd also suggest that, unless there's serious, substantiated demand for it--which I doubt--only major Unicode encodings be supported. Don't make it easier for people to keep using legacy encodings. Two other pieces of feedback I received from Adam Barth off list: * take ArrayBufferView as input which both fixes any and simplifies the API to eliminate byteOffset and byteLength * support two versions of encode, one which takes a target ArrayBufferView, and one which allocates/returns a new Uint8Array of the appropriate length. Shouldn't this just be another ArrayBufferView type with special semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a getString()/setString() method pair on DataView? I don't think so, because retrieving the N'th decoded/reencoded character isn't a constant-time operation. -- Glenn Maynard
Re: [whatwg] Behavior when script is removed from DOM
On Wed, Dec 7, 2011 at 12:01 PM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Dec 7, 2011 at 11:27 AM, Adam van den Hoven a...@littlefyr.com wrote: On Sat, Dec 3, 2011 at 9:17 PM, Jonas Sicking jo...@sicking.cc wrote: On Sat, Dec 3, 2011 at 7:38 PM, Yehuda Katz wyc...@gmail.com wrote: Yehuda Katz (ph) 718.877.1325 On Sat, Dec 3, 2011 at 6:37 PM, Jonas Sicking jo...@sicking.cc wrote: On Sat, Dec 3, 2011 at 6:24 PM, Yehuda Katz wyc...@gmail.com wrote: Yehuda Katz (ph) 718.877.1325 On Fri, Dec 2, 2011 at 11:30 AM, Tab Atkins Jr. jackalm...@gmail.com wrote: On Fri, Dec 2, 2011 at 11:27 AM, Jonas Sicking jo...@sicking.cc wrote: The main use case for wanting to support scripts getting appears to be wanting to abort JSONP loads. Potentially to issue it with new parameters. This is a decent use case, but given the racyness described above in webkit, it doesn't seem like a reliable technique in existing browsers. If it's unreliable *and* no sites appear to break with the proper behavior, we shouldn't care about this use-case, since cross-domain XHR solves it properly. Cross-domain XHR *can* solve this use case, but the fact is that CORS is harder to implement JSONP, and so we continue to have a large number of web APIs that support JSONP but not CORS. Unfortunately, I do not forsee this changing in the near future. I think we can solve this in 3 ways: 1. Keep spec as it is. Pages can simply ignore the JSONP callback when it happens. Disadvantages: Additional bandwidth. More complexity for the web page. 2. Make removing scripts cancel any execution Disadvantages: Pages will have to deal with the fact that removing scripts can still cause the callback to happen if the load just finished. So the same amount of complexity for page authors that don't want buggy pages as alternative 1. Since many pages likely won't properly handle the callback happening anyway will likely cause pages to be buggy in contemporary browsers. 3. Add a new API to reliably cancel a script load Disadvantages: New API for pages to learn. 4. Add a new API (or customize XHR) to explicitly support JSONP requests, and allow those requests to be cancelled. Yes, that's definitely an option. It will be sort of a weird API since the security model will be sort of strange. Traditionally we say that you can't load data cross site, but that you can execute scripts cross site. Here we want something sort of in between. It could have significant advantages if it makes it easier for sites to do cross-site loading of data without exposing themselves to XSS risks. / Jonas If we went for a hybrid approach, namely that XHR has a cancellable way to call and execute some arbitrary JavaScript and sandbox the execution so that this is something explicitly provided to the XHR, would we not suddenly have a rather secure way to load any javascript in general (and probably make things like lab.js and yepnope easier to write)? Now I can load some javascript (say from some ad server) without giving it access to the window object and the global scope, if I don't want to. Wouldn't this address some of the security issues that Doug Crockford has brought up in the past? Yeah. This would be very cool. Proposals more than welcome, though I would suggest not tying it to XHR but rather have a dedicated load and execute this url in this sandbox API. Designing a sandbox API is likely a fairly large task. I believe that ES.next might have something to that extent but I'm not fully sure. Yeah, the modules proposal for ES harmony is fairly similar: http://wiki.ecmascript.org/doku.php?id=harmony:modules The relative bits for this thread are that a script can be loaded into a new pristine global environment (i.e. it doesn't just get to party on window, is shielded from any prior monkeying with Object.prototype, etc) and decides what to export (by applying properties to its global object); the script doing the import can decide what to pick up from from the global object of the imported module. This can't be implemented in JS today (e.g. as a shim) since that evaluate this script text in this new global sandbox bit isn't present. A dedicated JSONP API is likely a lot simpler to design and could be specced and rolled out quicker. But of course has a smaller feature set. / Jonas / Jonas
Re: [whatwg] Specs for window.atob() and window.btoa()
On Sat, Feb 5, 2011 at 6:37 PM, Joshua Cranmer pidgeo...@verizon.netwrote: On 02/05/2011 08:29 PM, Jonas Sicking wrote: So my first question is, can someone give examples of sources of base64 data which contains whitespace? The best guess I have is base64-encoding MIME parts, which would be hardwrapped every 70-80 characters or so. RFC 3548 The Base16, Base32, and Base64 Data Encodings Section 2.1 discusses line feeds in encoded data, calling out the MIME line length limit. For example, Perl's MIME::Base64 has an encode_base64() API that by default inserts newlines after 76 characters. (An optional argument allows this behavior to be overridden.) Section 2.3 discusses Interpretation of non-alphabet characters in encoded data specifically in base64 (etc) encoded data. -- Josh
Re: [whatwg] Specs for window.atob() and window.btoa()
On Fri, Jan 7, 2011 at 9:27 AM, Aryeh Gregor simetrical+...@gmail.comsimetrical%2b...@gmail.com wrote: On Fri, Jan 7, 2011 at 12:01 AM, Boris Zbarsky bzbar...@mit.edu wrote: Note that it's not that uncommon to use atob on things that came from other base64-producing tools, not just from btoa. Not sure whether that matters here. I don't think it does. I don't think any base64 encoding implementation is likely to pad input strings' lengths to a multiple of six bits using anything other than zero bits. So it's mostly just a matter of specification and testing simplicity. It might not hurt to include an *informative* note in the specification that some base64-encoding tools and APIs by default inject whitespace into any base64-encoded data they output; for example, line breaks after 76 characters. Therefore, defensively written programs that use window.aotb should consider the use of something akin to: var output = window.atob( input.replace(/\s+/g, ); Again, this would be informative only; rejection of input strings containing whitespace is already implicitly covered by your normative text.