Re: [whatwg] Accessing local files with JavaScript portably and securely

2017-04-19 Thread Joshua Bell
On Wed, Apr 19, 2017 at 8:23 AM, Roger Hågensen 
wrote:

> On 2017-04-19 11:28, Anne van Kesteren wrote:
>
>> I already pointed to https://wicg.github.io/entries-api/ as a way to
>> get access to a directory of files and  as a way to
>> get access to a sequence of files. Both for read access. I haven't
>> seen any interest to go beyond that.
>>
>
> Is this the Filesystem & FileWriter API ?
>

A small subset of the functionality specified in FileSystem was used by
Chrome to expose directory upload. Support for that subset necessary for
interop of directory upload has been implemented by Firefox and Edge. I put
up the entries-api spec to try and re-specify just that subset. (It's a
work in progress.)


> This was added to Chrome/Opera under the webkit prefix 7 years ago, Edge
> and Firefox has not picked this up yet (just the Reader part).
> (as shown by http://caniuse.com/#search=file )
>

The market apparently demonstrates that a sandboxed file system storage API
isn't high priority for browser vendors to implement.


>
> I avoid prefixed features, and try to use only features that latest
> Edge/Chrome/Firefox support so that end users are more likely to not end up
> in a situation where their browser do not support a app.
>
> And unless I remember wrong Firefox did support this at some point then
> removed it again.
>
>
> Take for example my soundbank app.
>
> A end user would want to either use a file selector or drag'n'drop to the
> app (browser) window to add files to the soundboard.
>
> Let us assume that 30+ sounds are added (I don't even think the
> filerequester handles multiselection properly in all browsers today)
>
> Would it be fair to expect the user to have to re-add these each time they
> start/open the app? During a week that is a lot of pointless work.
> Saving filenames is not practical, and even if it was there would be no
> paths.
>
> And storing the sounds in IndexDB or localStorage is out of the question
> as that is limited to a total of 5MB or even less in most browsers, 30+
> samples easily consumes that.
>

You may want to check again. An origin typically gets an order of magnitude
more storage than that for Indexed DB across browsers and devices.

>
> The ideal here is to make a html soundboard app locally (i.e file://) then
> copy it as is to a webserver. Users can either use it from there (http://
> or https:// online and/or offline) or "Save As" the document and use it
> locally (file://) for preservation or offline use without server dependency.
>
> The only way to make this work currently is to make the user hand write
> the path (full or relative) to each sound and store that in localStorage
> along with volume and fade in/out.
> But fade in and out is "faked" by adjusting the  volume as CORS
> prevents processing the audio and doing a proper crossfade between sounds
> which is possible but locked down due to CORS.
>
> I can understand limitations due to security concerns, but arbitrary
> limitations to functionality baffles me.
>
> I do not see much difference between file:// http(s):// besides one
> allowing serverside data processing and http headers, but these days most
> apps are entirely clientside. A sample editor can be written that is fully
> clientside, even including mic recording normalizing, FX, the server is not
> involved in any stage except delivering the .html file + a few lines of
> headers. The web app itself is identical (i.e. hash/checksum identical) be
> it http(s): or file:
>
> The benefit is that "the app is the source code" which is a ideal goal of
> open source as anyone can review and copy and modify as they please.
> And in theory it could run just as well truly offline/standalone as it
> could online without the need for a local webserver or similar.
>
> I'd dare say that thinking of a web app as something hosted only from a
> server via http(s) is a antiquated idea.
> These days a "web" app can be hosted via anything, want to open a webapp
> that is served from a cloud storage like Dropbox? Not a problem.
> Well, almost not a problem. a cloud storage probably do not have the
> proper CORS header to allow a sample editor to process sound from local
> files or files stored on a different cloud service.
>
> And a soundboard or a sample editor is just two examples, a image or video
> edit would have similar issues. OR what about a game with mod support?
> Being able to drag'n'drop a mod onto a game and then have the game load it
> the next time you start the game would be a huge benefit.
> But currently this can not be done, the mod would have to be uploaded to
> the server the game is served from, even if the game itself does not use or
> need any serverside scripting.
>
> Or imagine a medical app that needs to read in CSV data, such a app could
> work fully offline/local and load up the data each time it's started.
> Storing the data in localstorage/indexDB would be limited to whatever else
> is stored as far as size 

Re: [whatwg] Persistent and temporary storage

2015-03-16 Thread Joshua Bell
On Mon, Mar 16, 2015 at 1:38 AM, Anne van Kesteren ann...@annevk.nl wrote:

 On Fri, Mar 13, 2015 at 5:06 PM, Joshua Bell jsb...@chromium.org wrote:
  A handful of us working on Chrome have been having similar discussions
  around what we've been calling durable storage. In its simplest model a
  bit granted by the user to an origin, which then requires explicit user
  action before the data might be cleared under storage pressure, so it
  sounds like our thinking is broadly aligned, although we're still
 exploring
  various possibilities and their implications for permission prompts,
  cleanup UI, behavior under pressure, etc.

 Yeah, same here, wiki page outlines a tentative plan.


Gotcha. And thanks again for opening up this discussion!


  Similarly, we've been trying to keep this orthogonal from quota (either
 the
  UA's logic for assigning a quota to an origin quota, or possible
  standardized quota APIs), although the UA may use similar signals for
  granting permissions/assigning quota.

 I think we've come around in that we need to expose quota in some way
 to give developers some expectations to how much they can fetch and
 then store in best effort mode.


I think that matches our latest discussions too...


 But that for persistent it can be
 the whole disk.


... and we're waffling on that one. Going that far implies that the UA does
a really good job on its own or with user interaction to respond when the
storage is indeed getting full. Mobile OSes typically provide UI to inspect
how much storage is in use and clear apps and/or portions of their storage.
IMHO, we need to fully develop that UX in the UA before I'd be comfortable
letting sites easily consume the whole disk.

But we realize that artificially capping disk usage is a gap between web
and native, and so solving that problem is high priority for us. And I
don't think there are spec/standards implications here so we can move fast
on the UA side, as long as we spec that QuotaExceededError can happen on
various operations regardless of permissions, because even unlimited quota
can be constrained by physical limits.

 (FYI, we've been using durable and non-durable to distance the
  discussion from the now-loaded temporary vs. persistent terms which
  surfaced in earlier API proposals, some of which are implemented in
 Chrome)

 Ah right. Current set of terms I have is best effort (default; fixed
 quota), persistent (requires some kind of user opt-in, probably
 through an API-triggered dialog, but maybe also done if you pin a tab
 or bookmark or some such; 'unlimited' quota), and temporary (exists
 outside of best effort/persistent, e.g. for storing social network
 resources, other volatile assets, requires some kind of API opt-in;
 fixed quota).


If I'm reading the wiki page correctly, I'm intrigued by the temporary
proposal. To confirm, you're envisioning a completely new lightweight
storage API and there's no implied addition to the other storage APIs? If
so... well, pros and cons. I'm not a huge fan of adding Yet Another Storage
API. On the other hand, I'd rather do that then fork the existing storage
APIs into temp/persistent and try and shoehorn priorities into those.

If it helps I did a thought experiment a while ago on what would a
stripped-down, Promise-based IDB-lite look like? at
https://gist.github.com/inexorabletash/c8069c042b734519680c - it doesn't
have the priority scheme, but that would be easy to add at the 'open' entry
point.

...

One thing we should discuss under the storage umbrella is how atomically we
treat all storage for an origin. Customers we've talked to acknowledge the
reality that even durable storage can be wiped in the face of user action
(e.g. via settings UI to clear cookies etc) or file corruption. One of the
situations they're concerned about is dealing with partial clearing of
data, e.g. Indexed DB databases are present but the SW cache has been
wiped, or vice versa. Currently, for quota-based storage eviction, we evict
an origin's entire storage at once - that's easiest for sites to reason
about, since it matches the first time user or returning user on new
device scenarios that must already be supported. If we're taking a step
back to think of storage as a whole, we may want to provide more spec-level
assurance in this area.



 --
 https://annevankesteren.nl/



Re: [whatwg] Persistent and temporary storage

2015-03-13 Thread Joshua Bell
Very timely!

A handful of us working on Chrome have been having similar discussions
around what we've been calling durable storage. In its simplest model a
bit granted by the user to an origin, which then requires explicit user
action before the data might be cleared under storage pressure, so it
sounds like our thinking is broadly aligned, although we're still exploring
various possibilities and their implications for permission prompts,
cleanup UI, behavior under pressure, etc.

Similarly, we've been trying to keep this orthogonal from quota (either the
UA's logic for assigning a quota to an origin quota, or possible
standardized quota APIs), although the UA may use similar signals for
granting permissions/assigning quota.

(FYI, we've been using durable and non-durable to distance the
discussion from the now-loaded temporary vs. persistent terms which
surfaced in earlier API proposals, some of which are implemented in Chrome)

On Fri, Mar 13, 2015 at 7:25 AM, Janusz Majnert j.majn...@samsung.com
wrote:


 On 13.03.2015 15:01, Anne van Kesteren wrote:

 On Fri, Mar 13, 2015 at 2:58 PM, Janusz Majnert j.majn...@samsung.com
 wrote:

 The real question is why having a quota is useful?


 The reason developers want it is to know how much they can download
 and store without getting an exception.


 Which still doesn't guarantee they won't get an exception if the device
 runs out of space for whatever reason.


  Native apps are not
 controlled when it comes to storing data and nobody complains.


 Is there any documentation on how they handle the above scenario? Just
 write to disk until you hit failure?


 I think so. This is certainly the case with desktop apps. I also didn't
 find any mention of quota in Android download manager docs (
 http://developer.android.com/reference/android/app/DownloadManager.html)
 or in Tizen's Download API (https://developer.tizen.org/
 dev-guide/2.3.0/org.tizen.mobile.native.apireference/
 group__CAPI__WEB__DOWNLOAD__MODULE.html)



 Regards,
 --
 Janusz Majnert
 Senior Software Engineer
 Samsung RD Institute Poland
 Samsung Electronics



[whatwg] AppCache error event details

2014-03-24 Thread Joshua Bell
We'd like to move forward on adding event details to AppCache errors. [1]

I've fleshed out a proposal [2] that details the additions to the HTML
spec. It introduces a new event type (ApplicationCacheErrorEvent) with
reason (an enum), url, status, and message fields.

Feedback would be appreciated, particularly about what level of information
is safe to expose about cross-origin resource fetches which Chrome does
support with AppCache (as Hixie mentioned in [3]). The fact that a fetch
failure occurred with a specific URL does not appear to be a new piece of
information, but presumably further details (e.g. the specific http
response code, more detailed message) should not be exposed?

[1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=22702
[2]
https://docs.google.com/document/d/1nlk7WgRD3d0ZcfK1xrwBFVZ3DI_e44j7QoMd5gAJC4E/edit?usp=sharing
[3]
https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/blSfs6IqcvY/jCfGdH3p8eAJ


Re: [whatwg] BinaryEncoding for Typed Arrays using window.btoa and window.atob

2013-08-13 Thread Joshua Bell
On Mon, Aug 12, 2013 at 4:50 PM, Glenn Maynard gl...@zewt.org wrote:

 On Mon, Aug 12, 2013 at 12:16 PM, Joshua Bell jsb...@google.com wrote:

 To recap history: early iterations of the Encoding API proposal did have
 base64 but it was removed with the suggestion to extend atob()/btoa()
 instead, and due to the confusion around the encode/decode verbs. If the
 APIs were something like StringToBytesConverter::convert() and
 BytesToStringConverter::convert() it would make more sense for encoding of
 both text (use StringToBytes) and binary data (use BytesToString).


 I thought about suggesting something like StringToBytes, but that seems
 less obvious for the (probably) more common usage of encoding/decoding a
 String, and it's still a bit off (though not *strictly* wrong) for
 converting to UTF-16, UTF-32, etc.  I tend to think the slightly
 unintuitive names of TextEncoder and TextDecoder aren't bad enough that
 it's worth renaming them.


For completeness, it's also worth bringing up
https://developer.mozilla.org/en-US/docs/Code_snippets/StringView which
started this round of discussion (over on blink-dev) which is another more
neutral API design for binary/string data interop. I haven't read it
deeply, but it looks like it doesn't handle the streaming case, but does
explicitly tackle base64 without overloading text encoding methods.



  While we're re-opening this can of worms, there's been a request to add
 a flush() method to the TextEncoder/TextDecoder objects, which would behave
 the same as calling encode(null, {stream: false}) / decode(null,
 {stream:false}) but make the code more readable. This fails the adding a
 new method for something that behaves exactly like something we already
 have test. Opinions?


 I think you only need to say encode() and decode(), which is less of a
 win, especially since creating two ways of doing the same thing means that
 people have to learn both ways.  Otherwise, they'll see code end with
 .encode() and not realize that it's the same as the .finish() they've
 been using.


True. (I need to go back through this and other feedback that's trickled in
and see if I'm mis-representing it, and see if there's anything else
lingering.)



 On Mon, Aug 12, 2013 at 6:26 PM, Jonas Sicking jo...@sicking.cc wrote:

 I don't think that base64 encoding fits with the current
 TextEncoder/Decoder API. Not because of names, but because base64
 encoding is by nature opposite. I.e. the encoded format is in string
 form, whereas the decoded format is in binary form.


 The names are the only things that are opposite.  TextEncoder is just a
 streaming String-to-binary-blob conversion API, and TextDecoder is just a
 streaming binary-blob-to-String API, and that's precisely what base64
 encoding and decoding are.  That's the same whether you're converting
 String-to-base64 or String-to-UTF-8.  The only difference is that the names
 we've given to those ideas are reversed here.


Yes.



 One thing that might need special attention is that U+FFFD error handling
 doesn't make sense for base64; errors should probably always be fatal.


Excellent point.

...

I believe we may experiment with api-base64 and see if there are other
gotchas beyond this and the naming.




 --
 Glenn Maynard




Re: [whatwg] BinaryEncoding for Typed Arrays using window.btoa and window.atob

2013-08-12 Thread Joshua Bell
Back from a vacation, sorry about the late reply - hopefully still useful.

On Wed, Aug 7, 2013 at 3:02 PM, Glenn Maynard gl...@zewt.org wrote:

 On Wed, Aug 7, 2013 at 4:21 PM, Chang Shu csh...@gmail.com wrote:

  If we plan to enhance the Encoding spec, I personally prefer a new pair
 of

 BinaryDecoder/BinaryEncoder, which will be less confusing than reusing
  TextDecoder/TextEncoder.
 

 I disagree with the idea of adding a new method for something that behaves
 exactly like something we already have, just to give it a different name.

 (It may not be too late to rename those functions, if nobody has
 implemented them yet, but I'm not convinced it's much of a problem.)


FWIW, I've landed an experimental (behind a flag) implementation of the API
in Blink/Chromium; changing it is definitely possible for us. I believe Moz
is shipping it web-exposed already in FF?

To recap history: early iterations of the Encoding API proposal did have
base64 but it was removed with the suggestion to extend atob()/btoa()
instead, and due to the confusion around the encode/decode verbs. If the
APIs were something like StringToBytesConverter::convert() and
BytesToStringConverter::convert() it would make more sense for encoding of
both text (use StringToBytes) and binary data (use BytesToString).

While we're re-opening this can of worms, there's been a request to add a
flush() method to the TextEncoder/TextDecoder objects, which would behave
the same as calling encode(null, {stream: false}) / decode(null,
{stream:false}) but make the code more readable. This fails the adding a
new method for something that behaves exactly like something we already
have test. Opinions?


Re: [whatwg] Adding a btoa overload that takes Uint8Array

2013-03-04 Thread Joshua Bell
On Mon, Mar 4, 2013 at 9:09 AM, Boris Zbarsky bzbar...@mit.edu wrote:

 The problem I'm trying to solve is sending Unicode text to consumers who
 need base64-encoded input.  Right now the only sane way to do it (and I
 quote sane for obvious reasons) is something like the example at
 https://developer.mozilla.org/**en-US/docs/DOM/window.btoa#**
 Unicode_Stringshttps://developer.mozilla.org/en-US/docs/DOM/window.btoa#Unicode_Strings

 It seems like it would be better if the output of a TextEncoder could be
 passed directly to btoa.  But for that we need an overload of btoa that
 takes a Uint8Array.


FYI, I believe the last iteration on this topic ended with this message:

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-June/036372.html

i.e. consensus that base64 should stay out of the Encoding API, but that it
would be nice to have some form of base64 / Typed Array conversion API. But
there were no concrete proposals beyond my strawman in that post.

So: agreed, have at it!


Re: [whatwg] Encoding: API

2012-10-19 Thread Joshua Bell
On Thu, Oct 18, 2012 at 1:49 AM, Anne van Kesteren ann...@annevk.nl wrote:

 I added the API to the Encoding Standard:

   http://encoding.spec.whatwg.org/#api

 Feedback welcome. I suppose we might want to write an introduction for it
 too.


Thanks, Anne! Excellent cleanup, too.


On Thu, Oct 11, 2012 at 6:37 PM, Joshua Bell jsb...@chromium.org wrote:
  It sounds like there are several desirable behaviors:
 
  1. ignore BOM handling entirely (BOM would be present in output, or
 fatal)
  2. if matching BOM, consume; otherwise, ignore (mismatching BOM would be
  present in output, or fatal)
  3. switch encoding based on BOM (any of UTF-8, UTF-16LE, UTF-16BE)
  4. switch encoding based on BOM if-and-only-if UTF-16 explicitly
  specified, and only to one of the UTF-16 variants

 I went with supporting just 2 for now. 4 seems weird.


As per IRC discussion, if someone wants to implement this functionality it
is fairly simple from script.


On Thu, Oct 18, 2012 at 11:24 PM, Anne van Kesteren ann...@annevk.nlwrote:

 On Thu, Oct 18, 2012 at 4:16 PM, Glenn Maynard gl...@zewt.org wrote:
  On Thu, Oct 18, 2012 at 3:54 AM, Anne van Kesteren ann...@annevk.nl
 wrote:
  * TextDecoder.decode()'s view argument is no longer optional. Why should
  it be?
 
  It buffers the EOF byte when in streaming mode, eg. when the last byte
 of
  the stream is a UTF-8 continuation byte, so any encode errors are
 triggered.
 
  * TextEncoder.encode()'s input argument is no longer nullable. Again,
  why should it be?
 
  Likewise for encoding, to flush errors for trailing high surrogates.

 I made these arguments optional now (and named them both input). Note
 however that the way you get the EOF byte/EOF code point is by
 omitting the dictionary (whose stream member defaults to false), but I
 can see how not passing any arguments as a final call is convenient.


 https://github.com/whatwg/encoding/commit/39a201a5cdf43be3d49c6bac7952a0ecb225886b

 Yes, purely convenience. Otherwise you'd need to call:

decoder.decode(buffer1, {stream: true});
decoder.decode(buffer2, {stream: true});
decoder.decode(new Uint8Array());



  I also raised the issue of whether TextEncoder should really support
  utf-16/utf-16be as the encoding standard tries to deprecate non-utf-8
  encodings.
 
  The whole point of this API is to support legacy file formats that use
 other
  encodings.  (It's probably questionable to not support other encodings,
 too,
  eg. filenames in ZIP file headers, but starting out with Unicode is
 fine.)

 I thought it was mostly about reading legacy formats, but fair enough.


Jonas did a straw poll via Twitter about whether enoding to UTF-16 was
needed, and received positive feedback.


Re: [whatwg] Encoding: API

2012-10-10 Thread Joshua Bell
On Wed, Oct 10, 2012 at 6:42 AM, Anne van Kesteren ann...@annevk.nl wrote:

 Hey, I was wondering whether it would make sense to define
 http://wiki.whatwg.org/wiki/StringEncoding as part of
 http://encoding.spec.whatwg.org/ Tying them together makes sense to me
 anyway and is similar to what we do with URL, HTML, etc.


No objection from me.


 As for the open issue, I think it would make sense if the encoding's
 name was returned. Label is just some case-insensitive keyword to get
 there.


I tend to agree, as the label gives you no information you don't already
have and the name can be at least a diagnostic.


 I also still think it's kinda yucky that this API has this gigantic
 hack around what the rest of the platform does with respect to the
 byte order mark. It seems really weird to not expose the same
 encode/decode that HTML/XML/CSS/etc. use.


IMHO the API needs to support use cases: (1) code that wants to follow the
behavior of the web platform with respect to legacy content (i.e. the
desire to self-host), and (2) code that wants to parse files that are not
traditionally web data, i.e. fragments of binary files, which don't have
legacy behavior and where BOM taking priority would be surprising to
developers. For #2, following the behavior of APIs like ICU with respect to
BOMs is more sensible. I believe #2 is higher priority as long as it does
not preclude #1, and #1 can be achieved by code that inspects the stream
before handing it off to the decoder.

Practically speaking, this would mean refactoring the combined spec so that
the current BOM handling is defined for parsing web content outside of the
API rather than requiring the API to hack around it.

...

While we're here, any feedback from implementers? Mozilla is apparently
quite far along. Any surprises or additional issues? Any initial feedback
from users?

I received feedback recently that the API is perhaps too terse right now
when dealing with streaming content, and a more explicit decode(),
decodeStream(), resetStream() might be more intelligible. Thoughts?


Re: [whatwg] StringEncoding open issues

2012-09-17 Thread Joshua Bell
On Fri, Aug 17, 2012 at 5:19 PM, Jonas Sicking jo...@sicking.cc wrote:

 On Fri, Aug 17, 2012 at 7:15 AM, Glenn Maynard gl...@zewt.org wrote:
  On Fri, Aug 17, 2012 at 2:23 AM, Jonas Sicking jo...@sicking.cc wrote:
 
 - If encoding is utf-16 and the first bytes match 0xFF 0xFE or
   0xFE
 0xFF then set current encoding to utf-16 or utf-16be
   respectively and
 advance the stream past the BOM. The current encoding is used
   until the
 stream is reset.
 - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or
   0xEF
 0xBB 0xBF then set current encoding to utf-16, utf-16be or
   utf-8
 respectively and advance the stream past the BOM. The current
   encoding is
 used until the stream is reset.
 
  This doesn't sound right. The effect of the rules so far would be that
  if you create a decoder and specify utf-16 as encoding, and the
  first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to
  utf-8 decoding.
 
  I think the scope of the otherwise is unclear, and this is meant to be
  otherwise (if encoding is not utf-16).

 Ah, that would make sense. It effectively means if encoding is not set.

 / Jonas


I've attempted to distill the above into the spec in an algorithmic way:
http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder

English version: If you specify utf-16 you get endian-agnostic UTF-16
encoding support. Failing that, if your encoding matches your BOM it is
consumed. Failing *that*, you get whatever behavior falls out of the decode
algorithm (garbage, error, etc).

The JS shim has *not* been updated yet.

Only part of this edit has been live for the last few weeks - apologies to
the Moz folks who were trying to understand what the half-specified
internal useBOM flag was for. Any implementer feedback so far?


Re: [whatwg] StringEncoding open issues

2012-09-17 Thread Joshua Bell
On Mon, Sep 17, 2012 at 2:17 PM, Anne van Kesteren ann...@annevk.nl wrote:

 On Mon, Sep 17, 2012 at 11:13 PM, Joshua Bell jsb...@chromium.org wrote:
  I've attempted to distill the above into the spec in an algorithmic way:
  http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder
 
  English version: If you specify utf-16 you get endian-agnostic UTF-16
  encoding support. Failing that, if your encoding matches your BOM it is
  consumed. Failing *that*, you get whatever behavior falls out of the
 decode
  algorithm (garbage, error, etc).

 Why would we want the API to work different from how it works in
 markup (with meta charset etc.)? Granted it's not super logical, but
 I don't really see why we should make it inconsistent and more
 complicated.


That's how the spec started out, so a recap of this thread would give you
the back-and-forth that led here. To summarize:

Having the BOM in the content be higher priority than the coding selected
by the developer was not seen as desirable (see earlier in the thread), and
potentially a source of errors. Selecting encoding via BOM (in general, or
to emulate meta charset, etc) was seen as something that could be done in
user code if desired, but unexpected otherwise.

Two desired behaviors remained: (1) developer need for BOM-specified
endian-agnostic UTF-16 encoding similar to ICU's handling that
distinguishes utf-16 from utf-16le, and (2) that matching BOMs should
be consumed and not appear in the decoded data.


Re: [whatwg] StringEncoding open issues

2012-08-16 Thread Joshua Bell
On Wed, Aug 15, 2012 at 5:30 PM, Glenn Maynard gl...@zewt.org wrote:

 On Tue, Aug 14, 2012 at 12:34 PM, Joshua Bell jsb...@chromium.org wrote:

- Create an encoder with TextDecoder() and if present a BOM will be

respected (and consumed) otherwise default to UTF-8


 Let's not default to autodetect Unicode formats.  It encourages people
 to support UTF-16 when they may not mean to.  If BOM detection for both
 UTF-8 and UTF-16 is wanted, I'd suggest something explicit, like utf-*.

 If the argument to the ctor is optional, I think the default should be
 purely UTF-8.


Works for me. In the algorithm specified in the email, this simply removes
the clause If encoding is not specified, set an internal useBOM flag -
namely, only utf-16 gets the useBOM flag.

I'll attempt to wedge this into the spec soon.



  This gets easier if we restrict to encoding UTF-8 which typically doesn't
 include BOMs. But it's looking like there's enough desire to keep UTF-16
 encoding at the moment. Agree with just stripping it for now.


 UTF-8 sometimes does have a BOM, especially in Windows where applications
 sometimes use it to distinguish UTF-8 from ACP text files (which are just
 as common as ever--Windows has made no motion away from legacy encodings
 whatsoever).


Good point. Ah, Notepad, my old friend...


 Stripping the BOM can cause those applications to misinterpret the files
 as ACP.

 Anyway, even if the encoding API gives a helper for this, figuring out
 how that works would probably be more effort for developers than just
 peeking at the ArrayBuffer for the BOM and adding it back in manually.
 (I'm pretty sure anybody who knows enough to pay attention to this in the
 first place will have no trouble doing that.)  So, yeah, let's not worry
 about this.

 --
 Glenn Maynard




Re: [whatwg] StringEncoding open issues

2012-08-14 Thread Joshua Bell
On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard gl...@zewt.org wrote:

 I agree with Jonas that encoding should just use a replacement character
 (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
 other modes (eg. exceptions and user-specified replacement characters)
 until there's a clear need.

 My intuition is that encoding DOMString to UTF-16 should never have errors;
 if there are dangling surrogates, pass them through unchanged.  There's no
 point in using a placeholder that says an error occured here, when the
 error can be passed through in exactly the same form (not possible with eg.
 DOMString-SJIS).  I don't feel strongly about this only because outputting
 UTF-16 is so rare to begin with.

 On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell jsb...@chromium.org wrote:

  - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
  the byte order mark (the encoding-specific serialization of U+FEFF).


 This rarely detects the wrong type, but that doesn't mean it's not the
 wrong answer.  If my input is meant to be UTF-8, and someone hands me
 BOM-marked UTF-16, I want it to fail in the same way it would if someone
 passed in SJIS.  I don't want it silently translated.

 On the other hand, it probably does make sense for UTF-16 to switch to
 UTF-16BE, since that's by definition the original purpose of the BOM.

 The convention iconv uses, which I think is a useful one, is decoding from
 UTF-16 means try to figure out the encoding from the BOM, if any, and
 UTF-16LE and UTF-16BE mean always use this exact encoding.


Let me take a crack at making this into an algorithm:

In the TextDecoder constructor:

   - If encoding is not specified, set an internal useBOM flag
   - If encoding is specified and is a case insensitive match for utf-16
   set an internal useBOM flag.

NOTE: This means if utf-8, utf-16le or utf-16be is explicitly
specified the flag is not set.

When decode() is called

   - If useBOM is set and the stream offset is 0, then
  - If there are not enough bytes to test for a BOM then return without
  emitting anything (NOTE: if not streaming an EOF byte would be present in
  the stream which would be a negative match for a BOM)
  - If encoding is utf-16 and the first bytes match 0xFF 0xFE or 0xFE
  0xFF then set current encoding to utf-16 or utf-16be respectively and
  advance the stream past the BOM. The current encoding is used until the
  stream is reset.
  - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF
  0xBB 0xBF then set current encoding to utf-16, utf-16be or utf-8
  respectively and advance the stream past the BOM. The current encoding is
  used until the stream is reset.
   - Otherwise, if useBOM is not set and the steam offset is 0, then if the
   encoding is utf-8, utf-16 or utf-16be
  - If the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF
  then let detected encoding be utf-16, utf-16be or utf-8
respectively.
  If the detected encoding matches the object's encoding, advance
the stream
  past the BOM. Otherwise, if the fatal flag is set then throw a
  EncodingError DOMException. Otherwise, the decoding algorithm proceeds.
  - If there are not enough bytes to test for a BOM then return without
  emitting anything (NOTE: if not streaming an EOF byte would be inserted
  which would be a negative match for a BOM)

Working the current encoding switcheroo into the spec will require some
refactoring, so trying to get consensus here first.

In English:

   - Create an encoder with TextDecoder() and if present a BOM will be
   respected (and consumed) otherwise default to UTF-8
   - Create an encoder with TextDecoder(utf-16) and either UTF-16LE or
   UTF-16BE BOM will be respected (and consumed) otherwise default to UTF-16LE
   (which may decode garbage if UTF-8 BOM or other non-UTF-16 data is present)
   - Create an encoder with TextDecoder(utf-8,
   {fatal:true}), TextDecoder(utf-16le, {fatal:true}),
   TextDecoder(utf-16be, {fatal:true}) and a matching BOM will be consumed,
   a mismatching BOM will throw an EncodingError
   - Create an encoder with TextDecoder(utf-8), TextDecoder(utf-16le),
   TextDecoder(utf-16be) and a matching BOM will be consumed, a mismatching
   BOM will be blithely decoded (probably giving you replacement characters),
   but not throwing.

 * If one of the UTF encodings is specified AND the BOM matches then the
  leading BOM character (U+FEFF) MUST NOT be emitted in the output
 character
  sequence (i.e. it is silently consumed)
 

 It's a little weird that

 data = readFile(user-supplied-file.txt); // shortcutting for brevity
 var s = new TextDecoder(utf-16).decode(data); // or utf-8
 s = s.replace(a, b);
 var data2 = new TextEncoder(utf-16).encode(s);
 writeFile(user-supplied-file.txt, data2);

 causes the BOM to be quietly stripped away.  Normally if you're modifying a
 file, you want to pass through the BOM (or lack

Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-13 Thread Joshua Bell
Sorry if this is a dupe; I replied to this from my phone and an incorrect
address, and my earlier reply isn't showing in the archives.

On Fri, Aug 10, 2012 at 9:16 PM, Jonas Sicking jo...@sicking.cc wrote:

 The spec now contains the following text:

 NOTE: Because only UTF encodings are supported, and because of the
 algorithm used to convert a DOMString to a sequence of Unicode
 characters, no input can cause the encoding process to emit an encoder
 error.

 This is not correct. A DOMString is not a sequence of Unicode
 characters, it's a UTF16 encoded string (this is per EcmaScript). Thus
 it can contain unpaired surrogates and so the encoding process can
 result in encoder errors.

 As I've suggested earlier, I think we should deal with this by simply
 emitting Unicode replacement characters for these encoder errors (i.e.
 for unpaired surrogates).


Already accounted for. Note the phrase:

and because of the algorithm used to convert a DOMString to a sequence of
 Unicode characters


This refers to the normative text that generates a sequence of Unicode code
points from a DOMString by reference to the algorithm in WebIDL [1], which
handles unpaired surrogates etc.

This informative text should say Unicode code points rather than Unicode
characters, though. Fixing now and referenced [1] even in the note.

[1] http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-09 Thread Joshua Bell
On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell jsb...@chromium.org wrote:



 On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote:

 On 08/07/2012 07:51 PM, Jonas Sicking wrote:

  I don't mind supporting *decoding* from basically any encoding that
 Anne's spec enumerates. I don't see a downside with that since I
 suspect most implementations will just call into a generic decoding
 backend anyway, and so supporting the same set of encodings as for
 other parts of the platform should be relatively easy.


 [...]


  However I think we should consider restricting support to a smaller
 set of encodings for while *encoding*. There should be little reason
 for people today to produce text in non-utf formats. We might even be
 able to get away with only supporting UTF8, though I wouldn't be
 surprised if there are reasonably modern file formats which use utf16.


 FWIW, I agree with the decode-from-all-platform-**encodings
 encode-to-utf[8|16] position.


 Any disagreement on limiting the supported encodings to utf-8, utf-16, and
 utf-16be, while permitting decoding of all encodings in the Encoding spec?

 (This eliminates the what to do on encoding error issue nicely, still
 need to resolve the BOM issue though.)


http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the
supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE.

I'm tempted to take it further to just UTF-8 and see if anyone complains.

Jury is still out on the decode-with-BOM issue - I need to reason through
Glenn's suggestions on the open issues thread.

I added a related open issue raised by Glenn, summarized as ... suggest
that the .encoding attribute simply return the name that was passed to
the constructor. - taking this further, perhaps the attribute should be
eliminated as callers could apply it themselves.


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-08 Thread Joshua Bell
On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote:

 On 08/07/2012 07:51 PM, Jonas Sicking wrote:

  I don't mind supporting *decoding* from basically any encoding that
 Anne's spec enumerates. I don't see a downside with that since I
 suspect most implementations will just call into a generic decoding
 backend anyway, and so supporting the same set of encodings as for
 other parts of the platform should be relatively easy.


 [...]


  However I think we should consider restricting support to a smaller
 set of encodings for while *encoding*. There should be little reason
 for people today to produce text in non-utf formats. We might even be
 able to get away with only supporting UTF8, though I wouldn't be
 surprised if there are reasonably modern file formats which use utf16.


 FWIW, I agree with the decode-from-all-platform-**encodings
 encode-to-utf[8|16] position.


Any disagreement on limiting the supported encodings to utf-8, utf-16, and
utf-16be, while permitting decoding of all encodings in the Encoding spec?

(This eliminates the what to do on encoding error issue nicely, still
need to resolve the BOM issue though.)


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Joshua Bell
On Tue, Aug 7, 2012 at 8:32 AM, Glenn Maynard gl...@zewt.org wrote:

 On Mon, Aug 6, 2012 at 11:39 PM, Jonas Sicking jo...@sicking.cc wrote:

  I seem to have a recollection that we discussed only allowing encoding
  to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats
  as well as stay in sync with other APIs like XMLHttpRequest.
 


It looks like the relevant discussion was at
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html

It doesn't appear we reached consensus - there was some desire expressed to
scope to UTF-8, then perhaps expand to include UTF-16, definite consensus
that any encoding supported should be handled by both encode and decode,
then comments about XHR and form data encodings, but then the discussion
wandered into stateful vs. stateless encodings which took us off topic. So
Glenn's comment below pretty much reboots the conversation where it was:


 Not an objection, but where does XHR limit sent data to those encodings?
 send(FormData) forces UTF-8 (which is even more restrictive);
 send(Document) seems to allow any encoding *except* for UTF-16 (presumably
 web compat since that's a weird criteria).

 I'm not sure that staying in sync with XHR--which has its own pile of
 legacy code to support--is worthwhile here anyway, but limiting to Unicode
 seems fine in its own right, especially since the restriction can always be
 lifted later if real needs come up.

 However I currently can't find any restrictions on which target
  encodings are supported in the current drafts.


When Anne's spec appeared I gutted mine and deferred wherever possible to
his. One consequence of that was getting the other encodings for free as
far as the spec writing goes.

If we achieve consensus that we only want to support UTF encodings we can
add the restrictions. There are use cases for supporting other encodings
(parsing legacy data file formats, for example), but that could be deferred.


  One wrinkle in this is if we want to support arbitrary encodings when
  encoding, that means that we can't use insert a the replacement
  character as default error handling since that isn't available in a
  lot of encoding formats.
 

 I don't think this part is a real hurdle.  Just replace with ? for
 non-Unicode encodings.



On Tue, Aug 7, 2012 at 8:10 AM, Joshua Cranmer pidgeo...@verizon.netwrote:

  I found that the wiki version of the proposal cites 
  http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html as the way to
  find encodings.
 

 That spec documents the encodings which are used anywhere in the platform,
 but that doesn't necessarily mean every API needs to support all those
 encodings.  It's almost all backwards-compatibility.


There are also cross-browser differences in handling decoding of certain
code points in certain encodings. Exposing those encodings in a new API
would either require that the browser vendors expose those differences
(bleah) or implement a compatibility switch in the affected codecs (bleah).


Re: [whatwg] StringEncoding: encode() return type looks weird in the IDL

2012-08-06 Thread Joshua Bell
On Sun, Aug 5, 2012 at 11:44 AM, Boris Zbarsky bzbar...@mit.edu wrote:

 On 8/5/12 1:39 PM, Glenn Maynard wrote:

 I didn't say it was extensibility, just a leftover from something that
 was either considered and dropped or forgotten about.


 Oh, I see.  I thought you were talking about leaving the return value
 as-is so that Uint16Array return values can be added later.

 I'd vote for changing the return type to Uint8Array as things stand, and
 if we ever change what the function can return, we change the return type
 at that point.


Thanks. Yes, having the return type be ArrayBufferView in the IDL is just a
leftover. Fixing it now to be Uint8Array.

I'll start another thread on StringEncoding shortly summarizing open
issues, but anyone reading this thread is encouraged to take a look at
http://wiki.whatwg.org/wiki/StringEncoding and craft opinions.


[whatwg] StringEncoding open issues

2012-08-06 Thread Joshua Bell
Regarding the API proposal at: http://wiki.whatwg.org/wiki/StringEncoding

It looks like we've got some developer interest in implementing this, and
need to nail down the open issues. I encourage folks to look over the
Resolved issues in the wiki page and make sure the resolutions - gathered
from loose consensus here and offline discussion - are truly resolved or if
anything is not future-proof and should block implementations from
proceeding. Also, look at the Notes to Implementers section; this should
be non-controversial but may be non-obvious.

This leaves two open issues: behavior on encoding error, and handling of
Byte Order Marks (BOMs)

== Encoding Errors ==

The proposal builds on Anne's
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec,
which defines when encodings should emit an encoder error. In that spec
(which describes the existing behavior of Web browsers) encoders are used
in a limited fashion, e.g. for encoding form results before submission via
HTTP, and hence the cases are much more restricted than the errors
encountered when browsers are asked to decode content from the wild. As
noted, the encoding process could terminate when an error is emitted.
Alternately (and as is necessary for forms, etc) there is a
use-case-specific escaping mechanism for non-encodable code points.

The proposed TextDecoder object takes a TextDecoderOptions options with a
|fatal| flag that controls the decode behavior in case of error - if
|fatal| is unset (default) a decode error produces a fallback character
(U+FFFD); if |fatal| is set then a DOMException is raised instead.

No such option is currently proposed for the TextEncoder object; the
proposal dictates that a DOMException is thrown if the encoder emits an
error. I believe this is sufficient for V1, but want feedback. For V2 (or
now, if desired), the API could be extended to accept an options object
allowing for some/all of these cases;

* Don't throw, instead emit a standard/encoding-specific replacement
character (e.g. '?')
* Don't throw, instead emit a fixed placeholder character (byte?) sequence
* Don't throw, instead call a user-defined callback and allow it to produce
a replacement escaped character sequence, e.g. #x;

The latter seems the most flexible (superset of the rest) but is probably
overkill for now. Since it can be added in easily later, can we defer until
we have implementer and user feedback?


== Byte Order Marks (BOMs) ==

Once again, the proposal builds on Anne's
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec,
which describes the existing behavior of Web browsers. In the wild,
browsers deal with a variety of mechanisms for indicating the encoding of
documents (server headers, meta tags, XML preludes, etc), many of which are
blatantly incorrect or contradictory. One form is fortunately rarely wrong
- if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
the byte order mark (the encoding-specific serialization of U+FEFF). This
is built into the Encoding spec - given a byte sequence to decode and an
encoding label, the label is ignored if the sequence starts with one of the
three UTF BOMs, and the BOM-indicated encoding is used to decode the rest
of the stream.

The proposed API will have different uses, so it is unclear that this is
necessary or desirable.

At a minimum, it is clear that:

* If one of the UTF encodings is specified AND the BOM matches then the
leading BOM character (U+FEFF) MUST NOT be emitted in the output character
sequence (i.e. it is silently consumed)

Less clear is this behavior in these two cases.

* If one of the UTF encodings is specified AND and a different BOM is
present (e.g. UTF-16LE but a UTF-16BE BOM)
* If one of the non-UTF encodings is specified AND a UTF BOM is present

Options include:
* Nothing special - decoder does what it will with the bytes, possibly
emitting garbage, possibly throwing
* Raise a DOMException
* Switch the decoder from the user-specified encoding to the DOM-specified
encoding

The latter seems the most helpful when the proposed API is used as follows:

var s = TextDecoder().decode(bytes); // handles UTF-8 w/o BOM and any UTF
w/ BOM

... but it does seem a little weird when used like this;

var d = TextDecoder('euc-jp');
assert(d.encoding === 'euc-jp');
var s = d.decode(new Uint8Array([0xFE]), {stream: true});
assert(d.encoding === 'euc-jp');
assert(s.length === 0); // can't emit anything until BOM is definitely
passed
s += d.decode(new Uint8Array([0xFF]), {stream: true});
assert(d.encoding === 'utf-16be'); // really?


Re: [whatwg] StringEncoding: encode() return type looks weird in the IDL

2012-08-06 Thread Joshua Bell
On Sun, Aug 5, 2012 at 10:29 AM, Glenn Maynard gl...@zewt.org wrote:

 I guess the brokenness of Uint16Array (eg. the current lack of
 Uint16LEArray) could be sidestepped by just always returning Uint8Array,
 even if encoding to a 16-bit encoding (which is what it currently says to
 do).  Maybe that's better anyway, since it avoids making UTF-16 a special
 case.


+1 - which is why I pushed back on returning a Uint16Array earlier in the
discussion.


  I guess that if you're converting a string to a UTF-16 ArrayBuffer,
 you're probably doing it to quickly dump it into a binary field somewhere
 anyway--if you wanted to *examine* the codepoints, you'd just look at the
 DOMString you started with.


+1 again, and nicely stated. When I was a potential consumer of such an
API, I was happy to treat the encoded form as a black box.


Re: [whatwg] binary encoding

2012-06-12 Thread Joshua Bell
On Tue, Jun 12, 2012 at 2:29 AM, Simon Pieters sim...@opera.com wrote:

 On Mon, 11 Jun 2012 18:20:55 +0200, Joshua Bell jsb...@chromium.org
 wrote:

  
 http://wiki.whatwg.org/wiki/**StringEncodinghttp://wiki.whatwg.org/wiki/StringEncoding


  defines a binary encoding
 (basically the official iso-8859-1 where it is not mapped to
 windows-1252).



  which is residue from earlier iterations. Intended use case was
 interop with legacy JS that used the lower 8 bits of strings to hold
 binary
 data, e.g. with APIs like atob()/btoa().


 I think we should drop this and extend atob() and btoa() to be able to
 convert base64 strings to ArrayBuffer[View?] and back directly.


Agreed (I wanted a little more consensus before removing it).

Now that we can get binary data into script directly will there still be
active use of base64 + ArrayBuffers that will benefit from platform
support? Anyone want to tackle specifying the atob/btoa extensions? As a
strawman:

partial interface ArrayBufferView {
DOMString toBase64();
};

partial interface ArrayBuffer {
static ArrayBuffer fromBase64(DOMString string);
};

These don't handle data streaming scenarios, however.

(This is completely orthogonal to Anne's question about whether a binary
encoding should be specified somewhere to describe current implementations.)


Re: [whatwg] binary encoding

2012-06-11 Thread Joshua Bell
On Mon, Jun 11, 2012 at 6:03 AM, Anne van Kesteren ann...@annevk.nl wrote:

 http://wiki.whatwg.org/wiki/StringEncoding


... hasn't been getting much attention from me recently. I'll recap the
open issues and proposed resolutions to this list soonish.


 defines a binary encoding
 (basically the official iso-8859-1 where it is not mapped to
 windows-1252).


 which is residue from earlier iterations. Intended use case was
interop with legacy JS that used the lower 8 bits of strings to hold binary
data, e.g. with APIs like atob()/btoa().


 Is it an idea to move that
 http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html somehow?


On its own, this use case is probably not strong enough to merit slipping a
pseudo-encoding into the platform, but...On its own, this use case is
probably not strong enough to merit slipping a pseudo-encoding into the
platform, but...


 I
 do not think we want to give it an officially supported label, but it
 does make some sense to define it using the same infrastructure.
 http://dvcs.w3.org/hg/xhr/raw-file/tip/Overview.html has the same need
 for converting certain types of DOMString.


... as there are other use cases then we should codify it. I have no
preferences as to label; the proposed JS API could specify a label for it,
but defer the specifics of the encoding to the Encoding spec. (I believe as
written I currently call out the special case that BOM detection should
never be done for binary which is already a special case, although BOM
detection vis-a-vis the API is itself an open issue)


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-04-04 Thread Joshua Bell
Any further input on Kenneth's suggestions?

Re: ArrayBufferView vs. DataView - I'm tempted to make the switch to just
DataView. As discussed below, data parsing/serialization operations will
tend to be associated with DataViews. As Glenn has mentioned elsewhere
recently, it is possible to accidentally do a buffer copy when mis-using
typed array constructors, while DataView avoids this. DataViews are cheap
to construct, and when I'm writing sample code for the proposed API I find
I create throw-away DataViews anyway. Also, there is the potential for
confusion when using a non-Uint8Array buffer e.g. are the elements being
decoded using array[N] as the octets or using the underlying buffer? for
Uint16Array/UTF-16 encodings, what are the endianness concerns? DataView
APIs have an explicit endianness and no index getter, which alleviates this
somewhat.

Re: writing into an existing buffer - as Glenn says, most of the input
earlier in the thread advocated strongly for very simple initial API with
streaming support as the only fancy feature beyond the minimal string =
foo.decode(buffer) / buffer = foo.encode(string). Adding details =
foo.encodeInto(string, buffer) later on is not precluded if there is demand.

Also, I am planning to move the fatal option from the encode/decode
methods to the TextEncoder/TextDecoder constructors. Objections?

On Tue, Mar 27, 2012 at 7:43 PM, Kenneth Russell k...@google.com wrote:

 On Tue, Mar 27, 2012 at 6:44 PM, Glenn Maynard gl...@zewt.org wrote:
  On Tue, Mar 27, 2012 at 7:12 PM, Kenneth Russell k...@google.com wrote:
 
- I think it should reference DataView directly rather than
  ArrayBufferView. The typed array spec was specifically designed with
  two use cases in mind: in-memory assembly of data to be sent to the
  graphics card or audio device, where the byte order must be that of
  the host architecture;
 
 
  This is wrong, broken, won't be implemented this way by any production
  browser, isn't how it's used in practice, and needs to be fixed in the
  spec.  It violates the most basic web API requirement: interoperability.
  Please see earlier in the thread; the views affected by endianness need
 to
  be specced as little endian.  That's what everyone is going to implement,
  and what everyone's pages are going to depend on, so it's what the spec
  needs to say.  Separate types should be added for big-endian (eg.
  Int16BEArray).

 Thanks for your input.

 The design of the typed array classes was informed by requirements
 about how the OpenGL, and therefore WebGL, API work; and from prior
 experience with the design and implementation of Java's New I/O Buffer
 classes, which suffered from horrible performance pitfalls because of
 a design similar to that which you suggest.

 Production browsers already implement typed arrays with their current
 semantics. It is not possible to change them and have WebGL continue
 to function. I will go so far as to say that the semantics will not be
 changed.

 In the typed array specification, unlike Java's New I/O specification,
 the API was split between two use cases: in-memory data construction
 (for consumption by APIs like WebGL and Web Audio), and file and
 network I/O. The API was carefully designed to avoid roadblocks that
 would prevent maximum performance from being achieved for these use
 cases. Experience has shown that the moment an artificial performance
 barrier is imposed, it becomes impossible to build certain kinds of
 programs. I consider it unacceptable to prevent developers from
 achieving their goals.


  I also disagree that it should use DataView.  Views are used to access
  arrays (including strings) within larger data structures.  DataView is
 used
  to access packed data structures, where constructing a view for each
  variable in the struct is unwieldy.  It might be useful to have a helper
 in
  DataView, but the core API should work on views.

 This is one point of view. The true design goal of DataView is to
 supply the primitives for fast file and network input/output, where
 the endianness is explicitly specified in the file format. Converting
 strings to and from binary encodings is obviously an operation
 associated with transfer of data to or from files or the network.
 According to this taxonomy, the string encoding and decoding
 operations should only be associated with DataView, and not the other
 typed array types, which are designed for in-memory data assembly for
 consumption by other hardware on the system.


   - It would be preferable if the encoding API had a way to avoid
  memory allocation, for example to encode into a passed-in DataView.
 
 
  This was an earlier design, and discussion led to it being removed as a
  premature optimization, to simplify the API.  I'd recommend reading the
 rest
  of the thread.

 I do apologize for not being fully caught up on the thread, but hope
 that the input above was still useful.

 -Ken



Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Joshua Bell
On Sat, Mar 24, 2012 at 6:52 AM, Glenn Maynard gl...@zewt.org wrote:

 On Thu, Mar 22, 2012 at 8:58 AM, Anne van Kesteren ann...@opera.com
 wrote:

  Another way would be to have a second optional argument that indicates
  whether more bytes are coming (defaults to false), but I'm not sure of
 the
  chances that would be used correctly. The reasons you outline are
 probably
  why many browser implementations deal with EOF poorly too.


 It might not improve it, but I don't think it'd be worse.  If you didn't
 use it correctly for an encoding where it matters, the breakage would be
 obvious.

 Also, the previous automatically-streaming API has another possible
 misuse: constructing a single encoder, then calling it repeatedly for
 unrelated strings, without calling eof() between them (trailing bytes would
 become U+FFFD in the next string).  That'd be a less likely mistake with
 this, too.


Agreed. Simple things should be simple.


 Here's a suggestion, working from that:

 encoder = Encoder(euc-kr);
 view = encoder.encode(str1, {continues: true});
 view = encoder.encode(str2, {continues: true});
 view = encoder.encode(str3, {continues: false});

 An alternative way to end the stream:

 encoder = Encoder(euc-kr);
 view = encoder.encode(str1, {continues: true});
 view = encoder.encode(str2, {continues: true});
 view = encoder.encode(str3, {continues: true});
 view = encoder.encode(, {continues: false});
 // or view = encoder.encode(); // equivalent; continues defaults to false
 // or view = encoder.encode(); // maybe equivalent, if the first parameter
 is optional

 The simplest usage is concise enough that we don't really need a separate
 str.encode() method:

 view = Encoder(euc-kr).encode(str);

 If it has an eof() method, it'd just be a literal wrapper for
 encoder.encode(), but it can probably be omitted.


Agreed, I'd omit it.

Bikeshed: The |continues| term doesn't completely thrill me; it's clear in
context, but not necessarily what someone might go searching for.
{eof:true} would be lovely except we want the default to be yes-EOF but a
falsy JS value. |noEOF| ?

If there aren't immediate objections, I'll update my wiki draft with this
style of API, and see about updating my JS polyfill as well.

Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?

One object type is simpler for the non-streaming case, e.g.:

// somewhere globally
g_codec = Encoding(euc-kr);
// elsewhere...
str = g_codec.decode(view); // okay
view = g_codec.encode(str); // fine, no state captured
str = g_codec.decode(view); // still okay

but IMHO someone unfamiliar with the internals of encodings might extend
the above into::

// somewhere globally
g_codec = Encoding(euc-kr);
// elsewhere in some stream handling code...
str = g_codec.decode(view, {continues: true}); // okay..
view = g_codec.encode(str, {continues: true}); // sure, now both an encode
and decode state are captured by codec
str = g_codec.decode(view, {continues: true}); // okay only if this is more
of the same stream; if there are two incoming streams, this is wrong

The same mistake is possible with Encoder / Decoder objects, of course (you
just need two globals). But something about separating them makes it
clearer to me that the |continues| flag is affecting state in the object
rather than just affecting the output of the call.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Joshua Bell
On Mon, Mar 26, 2012 at 2:42 PM, Anne van Kesteren ann...@opera.com wrote:

 On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell jsb...@chromium.org
 wrote:

 Bikeshed: The |continues| term doesn't completely thrill me; it's clear
 in context, but not necessarily what someone might go searching for.
 {eof:true} would be lovely except we want the default to be yes-EOF but a
 falsy JS value. |noEOF| ?


 Peter Beverloo suggests stream on IRC. I like it.


+1


 Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?


 Two seems cleaner.


I've gone ahead and updated the wiki/draft:
http://wiki.whatwg.org/wiki/StringEncoding

This includes:

* TextEncoder / TextDecoder objects, with |encode| and |decode| methods
that take option dicts
* A |stream| option, per the above
* A |nullTerminator| option eliminates the need for a stringLength method
(hasta la vista, baby!)
* |encodedLength| method is dropped since you can't in-place encode anyway
* decoding errors yield fallback code points by default, but setting a
|fatal| option cause a DOMException to be thrown instead
* specified exceptions as DOMException of type EncodingError, as a
placeholder

New issues resulting from this refactor:

* You can change the options (stream, nullTerminator, fatal) midway through
decoding a stream. This would be silly to do, but as written I don't think
this makes the implementation more difficult. Alternately, the non-stream
options could be set on the TextDecoder object itself.

* BOM handling needs to be resolved. The Encoding spec makes the encoding
label secondary to the BOM. With this API it's unclear if that should be
the case. Options include having a mismatching BOM throw, treating a
mismatching BOM as a decoding error (i.e. fallback or throw, depending on
options), or allow the BOM to actually switch the decoder used for this
stream - possibly if-and-only-if the default encoding was specified.

I've also partially updated the JS polyfill proof-of-concept
implementation, tests, and examples as well, but it does not implement
streaming yet (i.e. a stream option is ignored, state is always lost); I
need to do a tiny bit more refactoring first.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Joshua Bell
On Mon, Mar 26, 2012 at 4:12 PM, Glenn Maynard gl...@zewt.org wrote:

 On Mon, Mar 26, 2012 at 4:49 PM, Joshua Bell jsb...@chromium.org wrote:

 * A |stream| option, per the above


 Does this make sense when you're using stream: false to flush the stream?
 It's still a streaming operation.  I guess it's close enough.

 * A |nullTerminator| option eliminates the need for a stringLength method
 (hasta la vista, baby!)


 I strongly disagree with this change.  It's much cleaner and more generic
 for the decoding algorithm to not know anything about null terminators, and
 to have separate general-purpose methods to determine the length of the
 string (memchr/wmemchr analogs, which we should have anyway).  We made this
 simplification a long time ago--why did you resurrect this?


Ah, I'd forgotten that there was consensus that doing this outside the API
was preferable. I'll remove the option when I touch the spec again.

* BOM handling needs to be resolved. The Encoding spec makes the encoding
 label secondary to the BOM. With this API it's unclear if that should be
 the case. Options include having a mismatching BOM throw, treating a
 mismatching BOM as a decoding error (i.e. fallback or throw, depending on
 options), or allow the BOM to actually switch the decoder used for this
 stream - possibly if-and-only-if the default encoding was specified.


 The path of fewest errors is probably to have a BOM override the specified
 UTF-16 endianness, so saying UTF-16BE just changes the default.


This would apply on if the previous call had {stream: false} (implicitly or
explicitly). Calling with {stream:false} would reset for the next call.

Would it apply only to UTF-16 or UTF-8 as well? Should there be any special
behavior when not specifying an encoding in the constructor?

On Mon, Mar 26, 2012 at 4:27 PM, Jonas Sicking jo...@sicking.cc wrote:

 A few comments:

 * It appears that we lost the ability to measure how long a resulting
 buffer was going to be and then decode into the buffer. I don't know
 if this is an issue.


True. On the plus side, the examples in the page (encode/decode
array-of-strings) didn't change size or IMHO readability at all.


 * It might be a performance problem to have to check for the
 fatal/nullTerminator options on each call.


No comment here. Moving the fatal and other options to the TextDecoding
object rather than the decode() call is a possibility. I'm not sure which I
prefer.


 * We lost the ability to decode from a arraybuffer and see how many
 bytes were consumed before a null-terminator was hit. One not terribly
 elegant solution would be to add a TextDecoder.decodeWithLength method
 which return a DOMString+length tuple.


Agreed, but of course see above - there was consensus earlier in the thread
that searching for null terminators should be done outside the API,
therefore the caller will have the length handy already. Yes, this would be
a big flaw since decoding at tightly packed data structure (e.g. array of
null terminated strings w/o length) would be impossible with just the
nullTerminator flag.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Joshua Bell
On Mon, Mar 26, 2012 at 6:24 PM, Glenn Maynard gl...@zewt.org wrote:

 I guess.  It doesn't seem that important, since it's just a few lines of
 code.  If this is done, I'd suggest that this helper API *not* have any
 special support for streaming (not to disallow it, but not to have any
 special handling for it, either).  I think streaming has little overlap
 with null-terminated fields, since null-termination is typically used with
 fixed-size buffers.  It would complicate things; for example, you'd need
 some way to signal to the caller that a null terminator was encountered.


Agreed.

Also worth relying to this thread is that in addition to null termination
there have been requests for other terminators, such as 0xFF which is an
invalid byte in a UTF-8 stream and thus a lovely terminator. Other byte
sequences were mentioned. (This was over in the Khronos WebGL list for
anyone who wants to dig it up. It was tracked as an unresolved ISSUE in the
spec.)

This supports the assertion that we should not special case null
terminators, but instead provide general (and highly optimizable) utilities
like memchr operating on buffers, since we can't anticipate every usage in
higher-level APIs like the one under discussion.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread Joshua Bell
On Wed, Mar 21, 2012 at 12:42 PM, Anne van Kesteren ann...@opera.comwrote:

 On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sicking jo...@sicking.cc
 wrote:

 This leaves us with 2 or 3. So the question is if we should support
 streaming or not. I suspect doing so would be worth it.


 For XMLHttpRequest it might be, yes.

 I think we should expose the same encoding set throughout the platform.
 One reason to limit the encoding set initially might be because we have not
 all converged yet on our encoding sets. Gecko, Safari, and Internet
 Explorer expose a lot more encodings than Opera and Chrome.


Just to throw it out there - does anyone feel we can/should offer
asymmetric encode/decode support, i.e. supporting more encodings for decode
operations than for encode operations?

As for the API, how about:

  enc = new Encoder(euc-kr)
  string1 = enc.encode(bytes1)
  string2 = enc.encode(bytes2)
  string3 = enc.eof() // might return empty string if all is fine

 And similarly you would have

  dec = new Decoder(shift_jis)
  bytes = dec.decode(string)

 Or alternatively you could have a single object that exposes both encode()
 and decode() and tracks state for both:

  enc = new Encoding(gb18030)
  bytes1  = enc.decode(string1)
  string2 = enc.encode(bytes2)


That's the direction my thinking was headed. Glenn pointed out that the
state that's implicitly captured in the above objects could instead be
returned as an explicit but opaque state object that's passed in and out of
stateless functions. As a potential user of the API, I find the above
object-oriented style easier to understand.

Re: Encoding object vs. an Encoder/Decoder pair - I'd prefer the latter as
it makes the state being captured and any methods/attributes to interrogate
the state clearer.

Bikeshedding on the name - we'd have to put String or Text in there
somewhere, since audio/video/image codecs will likely want to use similar
terms.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-20 Thread Joshua Bell
On Tue, Mar 20, 2012 at 7:26 AM, Glenn Maynard gl...@zewt.org wrote:

 On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking jo...@sicking.cc wrote:

 Why are encodings different than other parts of the API where you

 indeed have to know what works and what doesn't.


 Do you memorize lists of encodings?  I certainly don't.  I look them up as
 needed.

 UTF8 is stateful, so I disagree.


 No, UTF-8 doesn't require a stateful decoder to support streaming.  You
 decode up to the last codepoint that you can decode completely.  The return
 values are the output data, the number of bytes output, and the number of
 bytes consumed; that's all you need to restart decoding later.  That's the
 iconv(3) approach that we're probably all familiar with, which works with
 almost all encodings.

 ISO-2022 encodings are stateful: you have to persistently remember the
 character subsets activated by earlier escape sequences.  An iconv-like
 streaming API is impossible; to support streamed decoding, you'd need to
 have a decoder object that the user keeps around in order to store that
 state.  http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure


Which seems like it leaves us with these options:

1. Only support encodings with stateless coding (possibly down to a minimum
of UTF-8)
2. Only provide an API supporting non-streaming coding (i.e. whole
strings/whole buffers)
3. Expand the API to return encoder/decoder objects that capture state

Any others?

Trying to do simplify the problem but take on both (1) and (2) without (3)
would lead to an API that could not encompass (3) in the future, which
would be a mistake.

I'll throw out that the in-progress design of a Globalization API for
ECMAScript -
http://norbertlindenberg.com/2012/02/ecmascript-internationalization-api/ -
is currently spec'd to both build on the existing locale-aware methods on
String/Number/Date prototypes as conveniences, as well as introducing the
Collator and *Format objects.

Should we start with UTF-8-only/non-streaming methods on
DOMString/ArrayBufferView, and avoid constraining a future API supporting
multiple, possibly stateful encodings and streaming?


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote:

 On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote:

 What's the use-case for the stringLength function? You can't decode
 into an existing datastructure anyway, so you're ultimately forced to
 call decode at which point the stringLength function hasn't helped
 you.


 stringLength doesn't return the length of the decoded string.  It returns
 the byte offset of the first \0 (or the length of the whole buffer, if
 none), for decoding null-terminated strings.  For multibyte encodings (eg.
 everything except UTF-16 and friends), it's just memchr(), so it's much
 faster than actually decoding the string.


And just to be clear, the use case is decoding data formats where string
fields are variable length null terminated.


 Currently the use-case of simply wanting to convert a string to a
 binary buffer is a bit cumbersome. You first have to call the
 encodedLength function, then allocate a buffer of the right size,
 then call the encode function.


 I suggested eg.

 result = encode(string, utf-8, null).output;

 which would create an ArrayBuffer of the required size.  Presumably the
 null ArrayBufferView argument would be optional, so you could just say
 encode(string, utf-8).


I think we want both encoding and destination to be optional. That leads us
to an API like:

out_dict = stringEncoding.encode(string, opt_dict);

.. where both out_dict and opt_dict are WebIDL Dictionaries:

opt_dict keys: view, encoding
out_dict keys: charactersWritten, byteWritten, output

... where output === view if view is supplied, otherwise a new Uint8Array
(or Uint8ClampedArray??)

If this instead is attached to String, it would look like:

out_dict = my_string.encode(opt_dict);

If it were attached to ArrayBufferView, having a right-size buffer
allocated for the caller gets uglier unless we include a static version.

It doesn't seem possible to implement the 'encode' function without
 doing multiple scans over the string. The implementation seems
 required both to check that the data can be decoded using the
 specified encoding, as well as check that the data will fit in the
 passed in buffer. Only then can the implementation start decoding the
 data. This seems problematic.


 Only if it guarantees that it doesn't write anything to the output buffer
 unless the entire result will fit.  I don't think we need to do that; just
 guarantee that it'll be truncated on a whole codepoint.


Agreed. Input/output dicts mean the API documentation a caller needs to
read to understand the usage is more complex than a function signature
which is why I resisted them, but it does seem like the best approach.
Thanks for pushing, Glenn!

In the create-a-buffer-on-the-fly case there will be some memory juggling
going on, either by initially over allocating or reallocating/moving.


 I also don't think it's a good idea to throw an exception for encoding
 errors. Better to convert characters to the unicode replacement
 character. I believe we made a similar change to the WebSockets
 specification recently.


 Was that change made?  I filed
 https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
 to be undecided.


Settling on an options dict means adding a flag to control this behavior
(throws: true ?) doesn't extend the API surface significantly.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote:


 And just to be clear, the use case is decoding data formats where string
 fields are variable length null terminated.


... and the spec should include normative guidance that length-prefixing is
strongly recommended for new data formats.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Fri, Mar 16, 2012 at 10:35 AM, Glenn Maynard gl...@zewt.org wrote:

 On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote:


 ... where output === view if view is supplied, otherwise a new Uint8Array
 (or Uint8ClampedArray??)


 Uint8Array is correct.  (Uint8ClampedArray is for image color data.)

 If UTF-16 or UTF-32 are supported, decoding to them should return
 Uint16Array and Uint32Array, respectively (with the return value being
 typed just to ArrayBufferView).


FYI, there was some follow up IRC conversation on this. With Typed Arrays
as currently specified - that is, that Uint16Array has platform endianness
- the above would imply that either platform endianness dictated the output
byte sequence (and le/be was ignored), or that encode(\uFFFD,
utf-16).view[0] might != 0xFFFD on some platforms.

There was consensus (among the two of us) that the output view's underlying
buffer's byte order would be le/be depending on the selected encoding.
There is not consensus over what the return view type should be -
Uint8Array, or pursue BE/LE variants of Uint16Array to conceal platform
endianness.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread Joshua Bell
FYI, I've updated http://wiki.whatwg.org/wiki/StringEncoding

* Rewritten in terms of Anne's Encoding spec and WebIDL, for algorithms,
encodings, and encoding selection, which greatly simplifies the spec. This
implicitly adds support for all of the other encodings defined therein - we
may still want to dictate a subset of encodings. A few minor issues noted
throughout the spec.
* Define a binary encoding, since that support was already in this spec.
We may decide to kill this but I didn't want to remove it just yet.
* Simplify methods to take ArrayBufferView instead of
any/byteOffset/byteLength. The implication is that you may need to use
temporary DataViews, and this is reflected in the examples.
* Call out more of the big open issues raised on this thread (e.g. where
should we hang this API)

Nothing controversial added, or (alas) resolved.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread Joshua Bell
On Wed, Mar 14, 2012 at 3:53 PM, Glenn Maynard gl...@zewt.org wrote:


 It's more than a naming problem.  With this string API, one side of the
 conversion is always a DOMString.  Base64 conversion wants
 ArrayBuffer-ArrayBuffer conversions, so it would belong in a separate API.


Huh. The scenarios I've run across are Base64-encoded binary data islands
embedded in textual container formats like XML or JSON, which yield a
DOMString I want to decode into an ArrayBuffer.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Joshua Bell
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote:

 On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote:

  Something that has come up a couple of times with content authors
  lately has been the desire to convert an ArrayBuffer (or part thereof)
  into a decoded string. Similarly being able to encode a string into an
  ArrayBuffer (or part thereof).
 

 There was discussion about this before:


 https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html
 http://wiki.whatwg.org/wiki/StringEncoding

 (I don't know why it was on the WebGL list; typed arrays are becoming
 infrastructural and this doesn't seem like it belongs there, even though
 ArrayBuffer was started there.)

 The API on that wiki page is a reasonable start.  For the same reasons that
 we discussed in a recent thread (
 http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
 conversion errors should use replacement (eg. U+FFFD), not throw
 exceptions.  The any arguments should be fixed.  Encoding to UTF-16
 should definitely not prefix a BOM, and UTF-16 having unspecified
 endianness is obviously bad.

 I'd also suggest that, unless there's serious, substantiated demand for
 it--which I doubt--only major Unicode encodings be supported.  Don't make
 it easier for people to keep using legacy encodings.


Two other pieces of feedback I received from Adam Barth off list:

* take ArrayBufferView as input which both fixes any and simplifies the
API to eliminate byteOffset and byteLength
* support two versions of encode, one which takes a target ArrayBufferView,
and one which allocates/returns a new Uint8Array of the appropriate length.



  Shouldn't this just be another ArrayBufferView type with special
  semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a
  getString()/setString() method pair on DataView?

 I don't think so, because retrieving the N'th decoded/reencoded character
 isn't a constant-time operation.

 --
 Glenn Maynard



Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Joshua Bell
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote:

 On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote:

  Something that has come up a couple of times with content authors
  lately has been the desire to convert an ArrayBuffer (or part thereof)
  into a decoded string. Similarly being able to encode a string into an
  ArrayBuffer (or part thereof).
 

 There was discussion about this before:


 https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html
 http://wiki.whatwg.org/wiki/StringEncoding

 (I don't know why it was on the WebGL list; typed arrays are becoming
 infrastructural and this doesn't seem like it belongs there, even though
 ArrayBuffer was started there.)


Purely historical; early adopters of Typed Arrays were folks prototyping
with WebGL who wanted to parse data files containing strings.

WHATWG makes sense, I just hadn't gotten around to shopping for a home.
(Administrivia: Is there need to propose a charter addition?)


 The API on that wiki page is a reasonable start.  For the same reasons that
 we discussed in a recent thread (
 http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
 conversion errors should use replacement (eg. U+FFFD), not throw
 exceptions.  The any arguments should be fixed.  Encoding to UTF-16
 should definitely not prefix a BOM, and UTF-16 having unspecified
 endianness is obviously bad.

 I'd also suggest that, unless there's serious, substantiated demand for
 it--which I doubt--only major Unicode encodings be supported.  Don't make
 it easier for people to keep using legacy encodings.


Two other pieces of feedback I received from Adam Barth off list:

* take ArrayBufferView as input which both fixes any and simplifies the
API to eliminate byteOffset and byteLength
* support two versions of encode, one which takes a target ArrayBufferView,
and one which allocates/returns a new Uint8Array of the appropriate length.



  Shouldn't this just be another ArrayBufferView type with special
  semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a
  getString()/setString() method pair on DataView?

 I don't think so, because retrieving the N'th decoded/reencoded character
 isn't a constant-time operation.

 --
 Glenn Maynard



Re: [whatwg] Behavior when script is removed from DOM

2011-12-07 Thread Joshua Bell
On Wed, Dec 7, 2011 at 12:01 PM, Jonas Sicking jo...@sicking.cc wrote:

 On Wed, Dec 7, 2011 at 11:27 AM, Adam van den Hoven a...@littlefyr.com
 wrote:
  On Sat, Dec 3, 2011 at 9:17 PM, Jonas Sicking jo...@sicking.cc wrote:
 
  On Sat, Dec 3, 2011 at 7:38 PM, Yehuda Katz wyc...@gmail.com wrote:
  
   Yehuda Katz
   (ph) 718.877.1325
  
  
   On Sat, Dec 3, 2011 at 6:37 PM, Jonas Sicking jo...@sicking.cc
 wrote:
  
   On Sat, Dec 3, 2011 at 6:24 PM, Yehuda Katz wyc...@gmail.com
 wrote:
   
Yehuda Katz
(ph) 718.877.1325
   
   
On Fri, Dec 2, 2011 at 11:30 AM, Tab Atkins Jr.
jackalm...@gmail.com
wrote:
   
On Fri, Dec 2, 2011 at 11:27 AM, Jonas Sicking jo...@sicking.cc
wrote:
 The main use case for wanting to support scripts getting appears
 to
 be
 wanting to abort JSONP loads. Potentially to issue it with new
 parameters. This is a decent use case, but given the racyness
 described above in webkit, it doesn't seem like a reliable
 technique
 in existing browsers.
   
If it's unreliable *and* no sites appear to break with the proper
behavior, we shouldn't care about this use-case, since
 cross-domain
XHR solves it properly.
   
   
Cross-domain XHR *can* solve this use case, but the fact is that
 CORS
is
harder to implement JSONP, and so we continue to have a large
 number
of
web
APIs that support JSONP but not CORS. Unfortunately, I do not
 forsee
this
changing in the near future.
  
   I think we can solve this in 3 ways:
  
   1. Keep spec as it is. Pages can simply ignore the JSONP callback
 when
   it happens.
   Disadvantages:
   Additional bandwidth.
   More complexity for the web page.
  
   2. Make removing scripts cancel any execution
   Disadvantages:
   Pages will have to deal with the fact that removing scripts can still
   cause the callback to happen if the load just finished. So the same
   amount of complexity for page authors that don't want buggy pages as
   alternative 1.
   Since many pages likely won't properly handle the callback happening
   anyway will likely cause pages to be buggy in contemporary browsers.
  
   3. Add a new API to reliably cancel a script load
   Disadvantages:
   New API for pages to learn.
  
  
   4. Add a new API (or customize XHR) to explicitly support JSONP
   requests,
   and allow those requests to be cancelled.
 
  Yes, that's definitely an option.
 
  It will be sort of a weird API since the security model will be sort
  of strange. Traditionally we say that you can't load data cross site,
  but that you can execute scripts cross site. Here we want something
  sort of in between.
 
  It could have significant advantages if it makes it easier for sites
  to do cross-site loading of data without exposing themselves to XSS
  risks.
 
  / Jonas
 
 
  If we went for a hybrid approach, namely that XHR has a cancellable way
 to
  call and execute some arbitrary JavaScript and sandbox the execution so
 that
  this is something explicitly provided to the XHR, would we not suddenly
  have a rather secure way to load any javascript in general (and probably
  make things like lab.js and yepnope easier to write)? Now I can load some
  javascript (say from some ad server) without giving it access to the
 window
  object and the global scope, if I don't want to. Wouldn't this address
 some
  of the security issues that Doug Crockford has brought up in the past?

 Yeah. This would be very cool. Proposals more than welcome, though I
 would suggest not tying it to XHR but rather have a dedicated load
 and execute this url in this sandbox API.

 Designing a sandbox API is likely a fairly large task. I believe that
 ES.next might have something to that extent but I'm not fully sure.


Yeah, the modules proposal for ES harmony is fairly similar:
http://wiki.ecmascript.org/doku.php?id=harmony:modules

The relative bits for this thread are that a script can be loaded into a
new pristine global environment (i.e. it doesn't just get to party on
window, is shielded from any prior monkeying with Object.prototype, etc)
and decides what to export (by applying properties to its global object);
the script doing the import can decide what to pick up from from the global
object of the imported module.

This can't be implemented in JS today (e.g. as a shim) since that evaluate
this script text in this new global sandbox bit isn't present.


 A dedicated JSONP API is likely a lot simpler to design and could be
 specced and rolled out quicker. But of course has a smaller feature
 set.

 / Jonas

 / Jonas



Re: [whatwg] Specs for window.atob() and window.btoa()

2011-02-05 Thread Joshua Bell
On Sat, Feb 5, 2011 at 6:37 PM, Joshua Cranmer pidgeo...@verizon.netwrote:

 On 02/05/2011 08:29 PM, Jonas Sicking wrote:

 So my first question is, can someone give examples of sources of
 base64 data which contains whitespace?

 The best guess I have is base64-encoding MIME parts, which would be
 hardwrapped every 70-80 characters or so.


RFC 3548 The Base16, Base32, and Base64 Data Encodings Section 2.1
discusses line feeds in encoded data, calling out the MIME line length
limit. For example, Perl's MIME::Base64 has an encode_base64() API that by
default inserts newlines after 76 characters. (An optional argument allows
this behavior to be overridden.)

Section 2.3 discusses Interpretation of non-alphabet characters in encoded
data specifically in base64 (etc) encoded data.

-- Josh


Re: [whatwg] Specs for window.atob() and window.btoa()

2011-01-07 Thread Joshua Bell
On Fri, Jan 7, 2011 at 9:27 AM, Aryeh Gregor
simetrical+...@gmail.comsimetrical%2b...@gmail.com
 wrote:

 On Fri, Jan 7, 2011 at 12:01 AM, Boris Zbarsky bzbar...@mit.edu wrote:


  Note that it's not that uncommon to use atob on things that came from
 other
  base64-producing tools, not just from btoa.  Not sure whether that
 matters
  here.

 I don't think it does.  I don't think any base64 encoding
 implementation is likely to pad input strings' lengths to a multiple
 of six bits using anything other than zero bits.  So it's mostly just
 a matter of specification and testing simplicity.


It might not hurt to include an *informative* note in the specification that
some base64-encoding tools and APIs by default inject whitespace into any
base64-encoded data they output; for example, line breaks after 76
characters. Therefore, defensively written programs that use window.aotb
should consider the use of something akin to:

var output = window.atob( input.replace(/\s+/g,  );

Again, this would be informative only; rejection of input strings containing
whitespace is already implicitly covered by your normative text.