Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-14 Thread Simon Pieters
On Thu, 09 Aug 2012 19:42:07 +0200, Joshua Bell jsb...@chromium.org  
wrote:


http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict  
the

supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE.

I'm tempted to take it further to just UTF-8 and see if anyone complains.


I was going to suggest doing so. We've gone UTF-8-only for new features  
(workers, webvtt, appcache manifest, etc). The Encoding spec says New  
content and formats must exclusively use the utf-8 encoding.. Is there a  
use case for utf-16/utf-16be?


--
Simon Pieters
Opera Software


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-14 Thread Jonas Sicking
I think the main reason would be if there are modern formats which use
UTF16 which we want to allow people to create documents in. I asked on
twitter for such formats and got some responses:

https://twitter.com/SickingJ/status/234060964058763264

/ Jonas

On Tue, Aug 14, 2012 at 7:42 AM, Simon Pieters sim...@opera.com wrote:
 On Thu, 09 Aug 2012 19:42:07 +0200, Joshua Bell jsb...@chromium.org wrote:

 http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict
 the
 supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE.

 I'm tempted to take it further to just UTF-8 and see if anyone complains.


 I was going to suggest doing so. We've gone UTF-8-only for new features
 (workers, webvtt, appcache manifest, etc). The Encoding spec says New
 content and formats must exclusively use the utf-8 encoding.. Is there a
 use case for utf-16/utf-16be?

 --
 Simon Pieters
 Opera Software


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-14 Thread Glenn Maynard
On Tue, Aug 14, 2012 at 9:42 AM, Simon Pieters sim...@opera.com wrote:

 On Thu, 09 Aug 2012 19:42:07 +0200, Joshua Bell jsb...@chromium.org
 wrote:

  
 http://wiki.whatwg.org/wiki/**StringEncodinghttp://wiki.whatwg.org/wiki/StringEncodinghas
  been updated to restrict the
 supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE.

 I'm tempted to take it further to just UTF-8 and see if anyone complains.


 I was going to suggest doing so. We've gone UTF-8-only for new features
 (workers, webvtt, appcache manifest, etc). The Encoding spec says New
 content and formats must exclusively use the utf-8 encoding.. Is there a
 use case for utf-16/utf-16be?


Specs can't (meaningfully) place normative requirements on all new content
and formats.  This should be a note.

-- 
Glenn Maynard


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-13 Thread Joshua Bell
Sorry if this is a dupe; I replied to this from my phone and an incorrect
address, and my earlier reply isn't showing in the archives.

On Fri, Aug 10, 2012 at 9:16 PM, Jonas Sicking jo...@sicking.cc wrote:

 The spec now contains the following text:

 NOTE: Because only UTF encodings are supported, and because of the
 algorithm used to convert a DOMString to a sequence of Unicode
 characters, no input can cause the encoding process to emit an encoder
 error.

 This is not correct. A DOMString is not a sequence of Unicode
 characters, it's a UTF16 encoded string (this is per EcmaScript). Thus
 it can contain unpaired surrogates and so the encoding process can
 result in encoder errors.

 As I've suggested earlier, I think we should deal with this by simply
 emitting Unicode replacement characters for these encoder errors (i.e.
 for unpaired surrogates).


Already accounted for. Note the phrase:

and because of the algorithm used to convert a DOMString to a sequence of
 Unicode characters


This refers to the normative text that generates a sequence of Unicode code
points from a DOMString by reference to the algorithm in WebIDL [1], which
handles unpaired surrogates etc.

This informative text should say Unicode code points rather than Unicode
characters, though. Fixing now and referenced [1] even in the note.

[1] http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-10 Thread Jonas Sicking
On Thu, Aug 9, 2012 at 10:42 AM, Joshua Bell jsb...@chromium.org wrote:
 On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell jsb...@chromium.org wrote:



 On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote:

 On 08/07/2012 07:51 PM, Jonas Sicking wrote:

  I don't mind supporting *decoding* from basically any encoding that
 Anne's spec enumerates. I don't see a downside with that since I
 suspect most implementations will just call into a generic decoding
 backend anyway, and so supporting the same set of encodings as for
 other parts of the platform should be relatively easy.


 [...]


  However I think we should consider restricting support to a smaller
 set of encodings for while *encoding*. There should be little reason
 for people today to produce text in non-utf formats. We might even be
 able to get away with only supporting UTF8, though I wouldn't be
 surprised if there are reasonably modern file formats which use utf16.


 FWIW, I agree with the decode-from-all-platform-**encodings
 encode-to-utf[8|16] position.


 Any disagreement on limiting the supported encodings to utf-8, utf-16, and
 utf-16be, while permitting decoding of all encodings in the Encoding spec?

 (This eliminates the what to do on encoding error issue nicely, still
 need to resolve the BOM issue though.)


 http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the
 supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE.

 I'm tempted to take it further to just UTF-8 and see if anyone complains.

 Jury is still out on the decode-with-BOM issue - I need to reason through
 Glenn's suggestions on the open issues thread.

 I added a related open issue raised by Glenn, summarized as ... suggest
 that the .encoding attribute simply return the name that was passed to
 the constructor. - taking this further, perhaps the attribute should be
 eliminated as callers could apply it themselves.

I could definitely live with removing the attribute.

/ Jonas


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-10 Thread Jonas Sicking
On Thu, Aug 9, 2012 at 10:42 AM, Joshua Bell jsb...@chromium.org wrote:
 On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell jsb...@chromium.org wrote:



 On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote:

 On 08/07/2012 07:51 PM, Jonas Sicking wrote:

  I don't mind supporting *decoding* from basically any encoding that
 Anne's spec enumerates. I don't see a downside with that since I
 suspect most implementations will just call into a generic decoding
 backend anyway, and so supporting the same set of encodings as for
 other parts of the platform should be relatively easy.


 [...]


  However I think we should consider restricting support to a smaller
 set of encodings for while *encoding*. There should be little reason
 for people today to produce text in non-utf formats. We might even be
 able to get away with only supporting UTF8, though I wouldn't be
 surprised if there are reasonably modern file formats which use utf16.


 FWIW, I agree with the decode-from-all-platform-**encodings
 encode-to-utf[8|16] position.


 Any disagreement on limiting the supported encodings to utf-8, utf-16, and
 utf-16be, while permitting decoding of all encodings in the Encoding spec?

 (This eliminates the what to do on encoding error issue nicely, still
 need to resolve the BOM issue though.)


 http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the
 supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE.

 I'm tempted to take it further to just UTF-8 and see if anyone complains.

 Jury is still out on the decode-with-BOM issue - I need to reason through
 Glenn's suggestions on the open issues thread.

 I added a related open issue raised by Glenn, summarized as ... suggest
 that the .encoding attribute simply return the name that was passed to
 the constructor. - taking this further, perhaps the attribute should be
 eliminated as callers could apply it themselves.

The spec now contains the following text:

NOTE: Because only UTF encodings are supported, and because of the
algorithm used to convert a DOMString to a sequence of Unicode
characters, no input can cause the encoding process to emit an encoder
error.

This is not correct. A DOMString is not a sequence of Unicode
characters, it's a UTF16 encoded string (this is per EcmaScript). Thus
it can contain unpaired surrogates and so the encoding process can
result in encoder errors.

As I've suggested earlier, I think we should deal with this by simply
emitting Unicode replacement characters for these encoder errors (i.e.
for unpaired surrogates).

/ Jonas


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-09 Thread Joshua Bell
On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell jsb...@chromium.org wrote:



 On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote:

 On 08/07/2012 07:51 PM, Jonas Sicking wrote:

  I don't mind supporting *decoding* from basically any encoding that
 Anne's spec enumerates. I don't see a downside with that since I
 suspect most implementations will just call into a generic decoding
 backend anyway, and so supporting the same set of encodings as for
 other parts of the platform should be relatively easy.


 [...]


  However I think we should consider restricting support to a smaller
 set of encodings for while *encoding*. There should be little reason
 for people today to produce text in non-utf formats. We might even be
 able to get away with only supporting UTF8, though I wouldn't be
 surprised if there are reasonably modern file formats which use utf16.


 FWIW, I agree with the decode-from-all-platform-**encodings
 encode-to-utf[8|16] position.


 Any disagreement on limiting the supported encodings to utf-8, utf-16, and
 utf-16be, while permitting decoding of all encodings in the Encoding spec?

 (This eliminates the what to do on encoding error issue nicely, still
 need to resolve the BOM issue though.)


http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the
supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE.

I'm tempted to take it further to just UTF-8 and see if anyone complains.

Jury is still out on the decode-with-BOM issue - I need to reason through
Glenn's suggestions on the open issues thread.

I added a related open issue raised by Glenn, summarized as ... suggest
that the .encoding attribute simply return the name that was passed to
the constructor. - taking this further, perhaps the attribute should be
eliminated as callers could apply it themselves.


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-08 Thread James Graham

On 08/07/2012 07:51 PM, Jonas Sicking wrote:


I don't mind supporting *decoding* from basically any encoding that
Anne's spec enumerates. I don't see a downside with that since I
suspect most implementations will just call into a generic decoding
backend anyway, and so supporting the same set of encodings as for
other parts of the platform should be relatively easy.


[...]


However I think we should consider restricting support to a smaller
set of encodings for while *encoding*. There should be little reason
for people today to produce text in non-utf formats. We might even be
able to get away with only supporting UTF8, though I wouldn't be
surprised if there are reasonably modern file formats which use utf16.


FWIW, I agree with the decode-from-all-platform-encodings 
encode-to-utf[8|16] position.


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-08 Thread Joshua Bell
On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote:

 On 08/07/2012 07:51 PM, Jonas Sicking wrote:

  I don't mind supporting *decoding* from basically any encoding that
 Anne's spec enumerates. I don't see a downside with that since I
 suspect most implementations will just call into a generic decoding
 backend anyway, and so supporting the same set of encodings as for
 other parts of the platform should be relatively easy.


 [...]


  However I think we should consider restricting support to a smaller
 set of encodings for while *encoding*. There should be little reason
 for people today to produce text in non-utf formats. We might even be
 able to get away with only supporting UTF8, though I wouldn't be
 surprised if there are reasonably modern file formats which use utf16.


 FWIW, I agree with the decode-from-all-platform-**encodings
 encode-to-utf[8|16] position.


Any disagreement on limiting the supported encodings to utf-8, utf-16, and
utf-16be, while permitting decoding of all encodings in the Encoding spec?

(This eliminates the what to do on encoding error issue nicely, still
need to resolve the BOM issue though.)


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Joshua Cranmer

On 8/7/2012 12:39 AM, Jonas Sicking wrote:

Hi All,

I seem to have a recollection that we discussed only allowing encoding
to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats
as well as stay in sync with other APIs like XMLHttpRequest.

However I currently can't find any restrictions on which target
encodings are supported in the current drafts.

One wrinkle in this is if we want to support arbitrary encodings when
encoding, that means that we can't use insert a the replacement
character as default error handling since that isn't available in a
lot of encoding formats.


I found that the wiki version of the proposal cites 
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html as the way 
to find encodings.


--
Beware of bugs in the above code; I have only proved it correct, not tried it. 
-- Donald E. Knuth



Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Glenn Maynard
On Mon, Aug 6, 2012 at 11:39 PM, Jonas Sicking jo...@sicking.cc wrote:

 I seem to have a recollection that we discussed only allowing encoding
 to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats
 as well as stay in sync with other APIs like XMLHttpRequest.


Not an objection, but where does XHR limit sent data to those encodings?
send(FormData) forces UTF-8 (which is even more restrictive);
send(Document) seems to allow any encoding *except* for UTF-16 (presumably
web compat since that's a weird criteria).

I'm not sure that staying in sync with XHR--which has its own pile of
legacy code to support--is worthwhile here anyway, but limiting to Unicode
seems fine in its own right, especially since the restriction can always be
lifted later if real needs come up.

However I currently can't find any restrictions on which target
 encodings are supported in the current drafts.

 One wrinkle in this is if we want to support arbitrary encodings when
 encoding, that means that we can't use insert a the replacement
 character as default error handling since that isn't available in a
 lot of encoding formats.


I don't think this part is a real hurdle.  Just replace with ? for
non-Unicode encodings.


On Tue, Aug 7, 2012 at 8:10 AM, Joshua Cranmer pidgeo...@verizon.netwrote:

 I found that the wiki version of the proposal cites 
 http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html as the way to
 find encodings.


That spec documents the encodings which are used anywhere in the platform,
but that doesn't necessarily mean every API needs to support all those
encodings.  It's almost all backwards-compatibility.

-- 
Glenn Maynard


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Joshua Bell
On Tue, Aug 7, 2012 at 8:32 AM, Glenn Maynard gl...@zewt.org wrote:

 On Mon, Aug 6, 2012 at 11:39 PM, Jonas Sicking jo...@sicking.cc wrote:

  I seem to have a recollection that we discussed only allowing encoding
  to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats
  as well as stay in sync with other APIs like XMLHttpRequest.
 


It looks like the relevant discussion was at
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html

It doesn't appear we reached consensus - there was some desire expressed to
scope to UTF-8, then perhaps expand to include UTF-16, definite consensus
that any encoding supported should be handled by both encode and decode,
then comments about XHR and form data encodings, but then the discussion
wandered into stateful vs. stateless encodings which took us off topic. So
Glenn's comment below pretty much reboots the conversation where it was:


 Not an objection, but where does XHR limit sent data to those encodings?
 send(FormData) forces UTF-8 (which is even more restrictive);
 send(Document) seems to allow any encoding *except* for UTF-16 (presumably
 web compat since that's a weird criteria).

 I'm not sure that staying in sync with XHR--which has its own pile of
 legacy code to support--is worthwhile here anyway, but limiting to Unicode
 seems fine in its own right, especially since the restriction can always be
 lifted later if real needs come up.

 However I currently can't find any restrictions on which target
  encodings are supported in the current drafts.


When Anne's spec appeared I gutted mine and deferred wherever possible to
his. One consequence of that was getting the other encodings for free as
far as the spec writing goes.

If we achieve consensus that we only want to support UTF encodings we can
add the restrictions. There are use cases for supporting other encodings
(parsing legacy data file formats, for example), but that could be deferred.


  One wrinkle in this is if we want to support arbitrary encodings when
  encoding, that means that we can't use insert a the replacement
  character as default error handling since that isn't available in a
  lot of encoding formats.
 

 I don't think this part is a real hurdle.  Just replace with ? for
 non-Unicode encodings.



On Tue, Aug 7, 2012 at 8:10 AM, Joshua Cranmer pidgeo...@verizon.netwrote:

  I found that the wiki version of the proposal cites 
  http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html as the way to
  find encodings.
 

 That spec documents the encodings which are used anywhere in the platform,
 but that doesn't necessarily mean every API needs to support all those
 encodings.  It's almost all backwards-compatibility.


There are also cross-browser differences in handling decoding of certain
code points in certain encodings. Exposing those encodings in a new API
would either require that the browser vendors expose those differences
(bleah) or implement a compatibility switch in the affected codecs (bleah).


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Glenn Maynard
On Tue, Aug 7, 2012 at 11:48 AM, Joshua Bell jsb...@chromium.org wrote:

 It doesn't appear we reached consensus - there was some desire expressed
 to scope to UTF-8, then perhaps expand to include UTF-16, definite
 consensus that any encoding supported should be handled by both encode and
 decode, then comments about XHR and form data encodings, but then the
 discussion wandered into stateful vs. stateless encodings which took us off
 topic. So Glenn's comment below pretty much reboots the conversation where
 it was:


I don't agree that we necessarily need to support both encode and decode
for every encoding.

For example, an MP3 tag editor supporting legacy ID3 tags may want to be
able to decode ISO-8859-1, since it allows tags in that encoding.  However,
there's no reason to ever write MP3 tags in anything but Unicode--they only
need decode support for 8859-1, not encode.

This pattern of decode support for legacy, but only encoding to Unicode,
seems common today.  Many email clients today (not a use case, just a
comparison) also decode from any encoding but send only in UTF-8.

That's not to say there are no use cases for encoding other encodings, but
it's much easier to relax the restriction later and allow them if we really
need to than it is to go the other way, and I think there's a danger of
perpetuating legacy encodings if we're not careful.

 There are also cross-browser differences in handling decoding of certain
 code points in certain encodings. Exposing those encodings in a new API
 would either require that the browser vendors expose those differences
 (bleah) or implement a compatibility switch in the affected codecs (bleah).


The real fix for this would be for browsers to implement the encodings in
the correct, interoperable way when exposed by this API, even if that means
that this API interprets data differently than eg. the HTML parser.  MS has
made it clear that they won't touch their encodings in any way, due to
legacy support, but hopefully that doesn't apply to a new API with no
legacy at all.  (If you want to find that out you'll need to ask on webapps
or through some other channel, since they're not on this list.)

-- 
Glenn Maynard


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Jonas Sicking
On Tue, Aug 7, 2012 at 9:48 AM, Joshua Bell jsb...@chromium.org wrote:
 Not an objection, but where does XHR limit sent data to those encodings?
 send(FormData) forces UTF-8 (which is even more restrictive);
 send(Document) seems to allow any encoding *except* for UTF-16 (presumably
 web compat since that's a weird criteria).

 I'm not sure that staying in sync with XHR--which has its own pile of
 legacy code to support--is worthwhile here anyway, but limiting to Unicode
 seems fine in its own right, especially since the restriction can always
 be
 lifted later if real needs come up.

 However I currently can't find any restrictions on which target
  encodings are supported in the current drafts.


 When Anne's spec appeared I gutted mine and deferred wherever possible to
 his. One consequence of that was getting the other encodings for free as
 far as the spec writing goes.

 If we achieve consensus that we only want to support UTF encodings we can
 add the restrictions. There are use cases for supporting other encodings
 (parsing legacy data file formats, for example), but that could be deferred.

I don't mind supporting *decoding* from basically any encoding that
Anne's spec enumerates. I don't see a downside with that since I
suspect most implementations will just call into a generic decoding
backend anyway, and so supporting the same set of encodings as for
other parts of the platform should be relatively easy.

That also means that we don't have to figure out which encodings we
need to support to support reading legacy file formats etc.

However I think we should consider restricting support to a smaller
set of encodings for while *encoding*. There should be little reason
for people today to produce text in non-utf formats. We might even be
able to get away with only supporting UTF8, though I wouldn't be
surprised if there are reasonably modern file formats which use utf16.

Restricting the encoding formats have the advantage of that we can
rely on the target encoding to support a consistent feature set. For
example we don't need to deal with defining what to do if we receive a
perfectly well formed string, but the target encoding doesn't support
all the characters in that string. Likewise we don't have to deal with
target encodings which doesn't support the replacement character.

/ Jonas


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Jonas Sicking
On Tue, Aug 7, 2012 at 10:47 AM, Glenn Maynard gl...@zewt.org wrote:
 On Tue, Aug 7, 2012 at 11:48 AM, Joshua Bell jsb...@chromium.org wrote:

 It doesn't appear we reached consensus - there was some desire expressed
 to scope to UTF-8, then perhaps expand to include UTF-16, definite consensus
 that any encoding supported should be handled by both encode and decode,
 then comments about XHR and form data encodings, but then the discussion
 wandered into stateful vs. stateless encodings which took us off topic. So
 Glenn's comment below pretty much reboots the conversation where it was:


 I don't agree that we necessarily need to support both encode and decode for
 every encoding.

 For example, an MP3 tag editor supporting legacy ID3 tags may want to be
 able to decode ISO-8859-1, since it allows tags in that encoding.  However,
 there's no reason to ever write MP3 tags in anything but Unicode--they only
 need decode support for 8859-1, not encode.

 This pattern of decode support for legacy, but only encoding to Unicode,
 seems common today.  Many email clients today (not a use case, just a
 comparison) also decode from any encoding but send only in UTF-8.

 That's not to say there are no use cases for encoding other encodings, but
 it's much easier to relax the restriction later and allow them if we really
 need to than it is to go the other way, and I think there's a danger of
 perpetuating legacy encodings if we're not careful.

Yup, that matches my feelings exactly.

  There are also cross-browser differences in handling decoding of certain
 code points in certain encodings. Exposing those encodings in a new API
 would either require that the browser vendors expose those differences
 (bleah) or implement a compatibility switch in the affected codecs (bleah).

 The real fix for this would be for browsers to implement the encodings in
 the correct, interoperable way when exposed by this API, even if that means
 that this API interprets data differently than eg. the HTML parser.  MS has
 made it clear that they won't touch their encodings in any way, due to
 legacy support, but hopefully that doesn't apply to a new API with no legacy
 at all.  (If you want to find that out you'll need to ask on webapps or
 through some other channel, since they're not on this list.)

I'm hoping that browsers in general will be able to converge on the
encoding databases that they have. Both as far as which encodings are
supported, and as far as what encoding tables those encodings support.
Anne's spec is a great first step in that direction. It'll definitely
take time before we have full convergence, but I see no reason that we
couldn't get there eventually. We were able to get there with HTML5
parsing after all :-)

/ Jonas


Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Joshua Cranmer

On 8/7/2012 12:48 PM, Joshua Bell wrote:
When Anne's spec appeared I gutted mine and deferred wherever possible 
to his. One consequence of that was getting the other encodings for 
free as far as the spec writing goes. If we achieve consensus that we 
only want to support UTF encodings we can add the restrictions. There 
are use cases for supporting other encodings (parsing legacy data file 
formats, for example), but that could be deferred. 


My main use case, and the only one I'm going to argue for, is being able 
to handle mail messages with this API, and the primary concern here is 
decoding. I'll agree with other sentiments in this thread that I don't 
particularly care about encoding to anything other than UTF-8 (it might 
be nice, but I can live without it); it's being able to decode $CHARSET 
that I'm concerned about. As far as edge cases in this scenario are 
concerned, it pretty much boils down to I want to produce the same JS 
string that would be output if I looked at the text content of the 
document data:text/plain;charset=charset,data.


When encoding, I think it is absolutely necessary to enforce a uniform 
guidelines for the output. When decoding, however, I think that most 
differences (beyond concerns like the BOM) are a result of buggy 
content creators as opposed to the browser media. Given that HTML 
display has apparently tolerated differences in charset decoding for 
legacy charsets, I suppose it is possible to live with a difference of 
exact character decoding for various charsets--in other words, turning 
the charset document into an advisory list of both minimum charsets to 
support and how to do so.


--
Beware of bugs in the above code; I have only proved it correct, not tried it. 
-- Donald E. Knuth



Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-07 Thread Glenn Maynard
On Tue, Aug 7, 2012 at 12:55 PM, Jonas Sicking jo...@sicking.cc wrote:

 I'm hoping that browsers in general will be able to converge on the
 encoding databases that they have. Both as far as which encodings are
 supported, and as far as what encoding tables those encodings support.
 Anne's spec is a great first step in that direction. It'll definitely
 take time before we have full convergence, but I see no reason that we
 couldn't get there eventually. We were able to get there with HTML5
 parsing after all :-)


MS has given a flat refusal to change the encoding tables in any way.
http://permalink.gmane.org/gmane.ietf.charsets/588

Personally I'm inclined to not care which encoding tables we use for legacy
encodings.  Rather than fight this battle and end up without interoperable
tables at all, it might be better to punt this one and standardize on
Microsoft's tables and be done with it.  (Sorry for the slight tangent;
this is an Encoding topic, not a StringEncoding issue.)

-- 
Glenn Maynard


[whatwg] StringEncoding: Allowed encodings for TextEncoder

2012-08-06 Thread Jonas Sicking
Hi All,

I seem to have a recollection that we discussed only allowing encoding
to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats
as well as stay in sync with other APIs like XMLHttpRequest.

However I currently can't find any restrictions on which target
encodings are supported in the current drafts.

One wrinkle in this is if we want to support arbitrary encodings when
encoding, that means that we can't use insert a the replacement
character as default error handling since that isn't available in a
lot of encoding formats.

/ Jonas