Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 18:13, Alastair Houghton  
> wrote:
> 
> On 16 May 2017, at 17:07, Hans Åberg  wrote:
>> 
> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
> UCS-2/UTF-16. ...
 
 The filesystem directory is using octet sequences and does not bother 
 passing over an encoding, I am told. Someone could remember one that to 
 used UTF-16 directly, but I think it may not be current.
>>> 
>>> No, that’s not true.  All three of those systems store UTF-16 on the disk 
>>> (give or take).
>> 
>> I am not speaking about what they store, but how the filesystem identifies 
>> files.
> 
> Well, quite clearly none of those systems treat the UTF-16 strings as binary 
> either - they’re case insensitive, so how could they?  HFS+ even normalises 
> strings using a variant of a frozen version of the normalisation spec.

HFS implements case insensitivity in a layer above the filesystem raw 
functions. So it is perfectly possible to have files that differ by case only 
in the same directory by using low level function calls. The Tenon MachTen did 
that on Mac OS 9 already.




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 17:07, Hans Åberg  wrote:
> 
 HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
 UCS-2/UTF-16. ...
>>> 
>>> The filesystem directory is using octet sequences and does not bother 
>>> passing over an encoding, I am told. Someone could remember one that to 
>>> used UTF-16 directly, but I think it may not be current.
>> 
>> No, that’s not true.  All three of those systems store UTF-16 on the disk 
>> (give or take).
> 
> I am not speaking about what they store, but how the filesystem identifies 
> files.

Well, quite clearly none of those systems treat the UTF-16 strings as binary 
either - they’re case insensitive, so how could they?  HFS+ even normalises 
strings using a variant of a frozen version of the normalisation spec.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 17:52, Alastair Houghton  
> wrote:
> 
> On 16 May 2017, at 16:44, Hans Åberg  wrote:
>> 
>> On 16 May 2017, at 17:30, Alastair Houghton via Unicode 
>>  wrote:
>>> 
>>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
>>> UCS-2/UTF-16. ...
>> 
>> The filesystem directory is using octet sequences and does not bother 
>> passing over an encoding, I am told. Someone could remember one that to used 
>> UTF-16 directly, but I think it may not be current.
> 
> No, that’s not true.  All three of those systems store UTF-16 on the disk 
> (give or take).

I am not speaking about what they store, but how the filesystem identifies 
files.




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 16:44, Hans Åberg  wrote:
> 
> On 16 May 2017, at 17:30, Alastair Houghton via Unicode  
> wrote:
>> 
>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
>> UCS-2/UTF-16. ...
> 
> The filesystem directory is using octet sequences and does not bother passing 
> over an encoding, I am told. Someone could remember one that to used UTF-16 
> directly, but I think it may not be current.

No, that’s not true.  All three of those systems store UTF-16 on the disk (give 
or take).  On Windows, the “ANSI” APIs convert the filenames to or from the 
appropriate Windows code page, while the “Wide” API works in UTF-16, which is 
the native encoding for VFAT long filenames and NTFS filenames.  And, as I 
said, on Mac OS X and iOS, the kernel expects filenames to be encoded as UTF-8 
at the BSD API, regardless of what encoding you might be using in your Terminal 
(this is different to traditional UNIX behaviour, where how you interpret your 
filenames is entirely up to you - usually you’d use the same encoding you were 
using on your tty).

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 17:30, Alastair Houghton via Unicode  
> wrote:
> 
> On 16 May 2017, at 14:23, Hans Åberg via Unicode  wrote:
>> 
>> You don't. You have a filename, which is a octet sequence of unknown 
>> encoding, and want to deal with it. Therefore, valid Unicode transformations 
>> of the filename may result in that is is not being reachable.
>> 
>> It only matters that the correct octet sequence is handed back to the 
>> filesystem. All current filsystems, as far as experts could recall, use 
>> octet sequences at the lowest level; whatever encoding is used is built in a 
>> layer above. 
> 
> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
> UCS-2/UTF-16. ...

The filesystem directory is using octet sequences and does not bother passing 
over an encoding, I am told. Someone could remember one that to used UTF-16 
directly, but I think it may not be current.





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 14:23, Hans Åberg via Unicode  wrote:
> 
> You don't. You have a filename, which is a octet sequence of unknown 
> encoding, and want to deal with it. Therefore, valid Unicode transformations 
> of the filename may result in that is is not being reachable.
> 
> It only matters that the correct octet sequence is handed back to the 
> filesystem. All current filsystems, as far as experts could recall, use octet 
> sequences at the lowest level; whatever encoding is used is built in a layer 
> above. 

HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
UCS-2/UTF-16.  FAT 8.3 names are also encoded, but the encoding isn’t specified 
(more specifically, MS-DOS and Windows assume an encoding based on your locale, 
which could cause all kinds of fun if you swapped disks with someone from a 
different country, and IIRC there are some shenanigans for Japan because of the 
use of 0xe5 as a deleted file marker).  There are some less widely used 
filesystems that require a particular encoding also (BeOS’ BFS used UTF-8, for 
instance).

Also, Mac OS X and iOS use UTF-8 at the BSD layer; if a filesystem is in use 
whose names can’t be converted to UTF-8, the Darwin kernel uses a percent 
encoding scheme(!)

It looks like Apple has changed its mind for APFS and is going with the “bag of 
bytes” approach that’s typical of other systems; at least, that’s what it 
appears to have done on iOS.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 15:23 GMT+02:00 Hans Åberg :

> All current filsystems, as far as experts could recall, use octet
> sequences at the lowest level; whatever encoding is used is built in a
> layer above
>

Not NTFS (on Windows) which uses sequences of 16bit units. Same about
FAT32/exFAT within "Long File Names" (the legacy 8.3 short filenames are
using legacy 8-bit codepages, but these are alternate filenames used when
long filenames are not found, and working mostly like aliasing physical
links on Unix filesystems, as if they were separate directory entries,
except that they are hidden by default when their matching LFN are already
shown)


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 15:00, Philippe Verdy  wrote:
> 
> 2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode :
> 
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode  
> > wrote:
> ...
> > I think Unicode should not adopt the proposed change.
> 
> It would be useful, for use with filesystems, to have Unicode codepoint 
> markers that indicate how UTF-8, including non-valid sequences, is translated 
> into UTF-32 in a way that the original octet sequence can be restored.
> 
> Why just UTF-32 ?

Synonym for codepoint numbers. It would suffice to add markers how it is 
translated. For example, codepoints meaning "overlong long length ", 
"byte", or whatever is useful.

> How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid 
> UTF-8/UTF-16/UTF-32 ?

You don't. You have a filename, which is a octet sequence of unknown encoding, 
and want to deal with it. Therefore, valid Unicode transformations of the 
filename may result in that is is not being reachable.

It only matters that the correct octet sequence is handed back to the 
filesystem. All current filsystems, as far as experts could recall, use octet 
sequences at the lowest level; whatever encoding is used is built in a layer 
above. 





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 14:44:44 +0200
Hans Åberg via Unicode  wrote:

> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode
> >  wrote:  
> ...
> > I think Unicode should not adopt the proposed change.  
> 
> It would be useful, for use with filesystems, to have Unicode
> codepoint markers that indicate how UTF-8, including non-valid
> sequences, is translated into UTF-32 in a way that the original octet
> sequence can be restored.

Escape sequences for the inappropriate bytes is the natural technique.
Your problem is smoothly transitioning so that the escape character is
always escaped when it means itself. Strictly, it can't be done.

Of course, some sequences of escaped characters should be prohibited.
Checking could be fiddly.
 
Richard.



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 20:08:52 +0900
"Martin J. Dürst via Unicode"  wrote:

> I agree with others that ICU should not be considered to have a
> special status, it should be just one implementation among others.

> [The next point is a side issue, please don't spend too much time on 
> it.] I find it particularly strange that at a time when UTF-8 is
> firmly defined as up to 4 bytes, never including any bytes above
> 0xF4, the Unicode consortium would want to consider recommending that
>  be converted to a single U+FFFD. I note with
> agreement that Markus seems to have thoughts in the same direction,
> because the proposal (17168-utf-8-recommend.pdf) says "(I suppose
> that lead bytes above F4 could be somewhat debatable.)".

The undesirable sidetrack, I suppose, is worrying about how many planes
will be required for emoji.

However, it does make for the point that, while some practices may be
better than other, there isn't necessarily a best practice.

The English of the proposal is unclear - the text would benefit from
showing some maximal subsequences (poor terminology - some of us are
used to non-contiguous subsequences).  When he writes, "For UTF-8,
recommend evaluating maximal subsequences based on the original
structural definition of UTF-8, without ever restricting trail bytes to
less than 80..BF", I am pretty sure he means "For UTF-8,
recommend evaluating maximal subsequences based on the original
structural definition of UTF-8, with the only restriction on trailing
bytes beyond the number of them being that they must be in the range
80..BF".

Thus Philippe's example of "E0 E0 C3 89" would be converted with an
error flagged to a sequence of scalar values FFFD FFFD C9.

This may make a UTF-8 system usable if it tries to use something like
non-characters as understood before CLDR was caught publishing them
as an essential part of text files.

Richard.



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode :

>
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode 
> wrote:
> ...
> > I think Unicode should not adopt the proposed change.
>
> It would be useful, for use with filesystems, to have Unicode codepoint
> markers that indicate how UTF-8, including non-valid sequences, is
> translated into UTF-32 in a way that the original octet sequence can be
> restored.


Why just UTF-32 ? How would you convert ill-formed UTF-8/UTF-16/UTF-32 to
valid UTF-8/UTF-16/UTF-32 ?

In all cases this would require extensions on the 3 standards (which MUST
be interoperable), then you'll shoke on new validation rules for these 3
standards for these extensions, and new ill-formed sequences that you won't
be able to convert interoperably. Given the most restrictive condition in
UTF-16 (which is still the most widely used internal representation), such
extensions would be very complex too manage.

There's no solution, such extensions in any one of them are then
undesirable and can only be used privately (but without interoperating with
the other 2 representations), so it's impossible to make sure the original
octet sequences can be restored.

Any deviation of the UTF-8/16/32 will be bounded in the same UTF. It cannot
be part of the 3 standard UTF, but may be part of a distinct encoding, not
fully compatible with the 3 standards.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 15 May 2017, at 12:21, Henri Sivonen via Unicode  
> wrote:
...
> I think Unicode should not adopt the proposed change.

It would be useful, for use with filesystems, to have Unicode codepoint markers 
that indicate how UTF-8, including non-valid sequences, is translated into 
UTF-32 in a way that the original octet sequence can be restored.





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode :

> > One additional note: the standard codifies this behaviour as a
> *recommendation*, not a requirement.
>
> This is an odd argument in favor of changing it. If the argument is
> that it's just a recommendation that you don't need to adhere to,
> surely then the people who don't like the current recommendation
> should choose not to adhere to it instead of advocating changing it.


I also agree. The internet is full of RFC specifications that are also
"best practices" and even in this case, changing them must be extensively
documented, including discussing new compatibility/interoperability
problems and new security risks.

The case of random access in substrings is significant because what was
once valid UTF-8 could become invalid if the best recommandation is not
followed, and then could cause unexpected failures, uncaught exceptions
causing software to suddenly fail and become subject to possible attacks
due to this new failure (this is mostly a problem for implementations that
do not use "safe" U+FFFD replacements but throw exceptions on ill-formed
input: we should not change the cases where these exceptions may occur by
adding new cases caused by a change of implementation based on a change of
best practice).

The considerations about trying to reduce the nnumber of U+FFFD is not
relevant, purely esthetic because some people would like to compact the
decoded result in memory. What is really import is to not ignore silently
these ill-formed sequences, and properly track that there was some data
loss. The number of U+FFFD (only one or as many as there are invalid code
units in the input before the first resynchronization point) inserted is
not so important.

As well, wether implementations will use an accumulator or just a single
state (where each state knows how many code units have been parsed without
emitting an output code point, so that these code points can be decoded by
relative indexed accesses) is not relevant, it is just a very minor
optimization case (in my opinion, using an accumulator that can live in a
CPU register is faster than using relative indexed accesses

All modern CPUs have enough registers to store that accumulator, and the
input and output pointers, and a finite state number is not needed when the
state can be tracked by the executable instruction position where you don't
necessarily need to loop for each code unit but can easily write your
decoder so that each loop will process a full codepoint or will emit a
single U+FFFD before adjusting the input pointer : UTF-8 and UTF-16
complexity is small enough that unwinding such loops will be easy to
implement for processing full code points instead of single code units:

That code will still remain very small (fitting fully in instruction
cache), and it will be faster because it will avoid several conditional
branches and because it will save one register (for the finite state
number) that will not ned to be slowly saved on a stack: 2 pointer
registers (or 2 access function/method addresses) + 2 data registers + the
PC instruction counter is enough.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Martin J. Dürst via Unicode

Hello everybody,

[using this mail to in effect reply to different mails in the thread]

On 2017/05/16 17:31, Henri Sivonen via Unicode wrote:

On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag  wrote:



Under what circumstance would it matter how many U+FFFDs you see?


Maybe it doesn't, but I don't think the burden of proof should be on
the person advocating keeping the spec and major implementations as
they are. If anything, I think those arguing for a change of the spec
in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
with the current spec should show why it's important to have a
different number of U+FFFDs than the spec's "best practice" calls for
now.


I have just checked (the programming language) Ruby. Some background:

As you might know, Ruby is (at least in theory) pretty 
encoding-independent, meaning you can run scripts in iso-8859-1, in 
Shift_JIS, in UTF-8, or in any of quite a few other encodings directly, 
without any conversion.


However, in practice, incl. Ruby on Rails, Ruby is very much using UTF-8 
internally, and is optimized to work well that way. Character encoding 
conversion also works with UTF-8 as the pivot encoding.


As far as I understand, Ruby does the same as all of the above software, 
based (among else) on the fact that we followed the recommendation in 
the standard. Here are a few examples (sorry for the linebreaks 
introduced by mail software):


$ ruby -e 'puts "\xF0\xaf".encode("UTF-16BE", invalid: :replace).inspect'
#=>"\uFFFD"

$ ruby -e 'puts "\xe0\x80\x80".encode("UTF-16BE", invalid: 
:replace).inspect'

#=>"\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xF4\x90\x80\x80".encode("UTF-16BE", invalid: 
:replace).inspect'

#=>"\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xfd\x81\x82\x83\x84\x85".encode("UTF-16BE", invalid: 
:replace).inspect

#=>"\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\x41\xc0\xaf\x41\xf4\x80\x80\x41".encode("UTF-16BE", 
invalid: :replace).inspect'

#=>"A\uFFFD\uFFFDA\uFFFDA"

This is based on http://www.unicode.org/review/pr-121.html as noted at
https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?revision=56516=markup#l1507
(for those having a look at these tests, in Ruby's version of 
assert_equal, the expected value comes first (not sure whether this is 
called little-endian or big-endian :-), but this is a decision where the 
various test frameworks are virtually split 50/50 :-(. ))


Even if the above examples and the tests use conversion to UTF-16 (in 
particular the BE variant for better readability), what happens 
internally is that the input is analyzed byte-by-byte. In this case, it 
is easiest to just stop as soon as something is found that is clearly 
invalid (be this a single byte or something longer). This makes a 
data-driven implementation (such as the Ruby transcoder) or one based on 
a state machine (such as http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) 
more compact.


In other words, because we never know whether the next byte is a valid 
one such as 0x41, it's easier to just handle one byte at a time if this 
way we can avoid lookahead (which is always a good idea when parsing).


I agree with Henri and others that there is no need at all to change the 
recommendation in the standard that has been stable for so long (close 
to 9 years).


Because the original was done on a PR 
(http://www.unicode.org/review/pr-121.html), I think this should at 
least also be handled as PR (if it's not dropped based on the discussion 
here).


I think changing the current definition of "maximal subsequence" is a 
bad idea, because it would mean that one wouldn't know what one was 
speaking about over the years. If necessary, new definitions should be 
introduced for other variants.


I agree with others that ICU should not be considered to have a special 
status, it should be just one implementation among others.


[The next point is a side issue, please don't spend too much time on 
it.] I find it particularly strange that at a time when UTF-8 is firmly 
defined as up to 4 bytes, never including any bytes above 0xF4, the 
Unicode consortium would want to consider recommending that 84 85> be converted to a single U+FFFD. I note with agreement that 
Markus seems to have thoughts in the same direction, because the 
proposal (17168-utf-8-recommend.pdf) says "(I suppose that lead bytes 
above F4 could be somewhat debatable.)".



Regards,Martin.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
>
> The proposal actually does cover things that aren’t structurally valid,
> like your e0 e0 e0 example, which it suggests should be a single U+FFFD
> because the initial e0 denotes a three byte sequence, and your 80 80 80
> example, which it proposes should constitute three illegal subsequences
> (again, both reasonable).  However, I’m not entirely certain about things
> like
>
>   e0 e0 c3 89
>
> which the proposal would appear to decode as
>
>   U+FFFD U+FFFD U+FFFD U+FFFD  (3)
>
> instead of a perhaps more reasonable
>
>   U+FFFD U+FFFD U+00C9 (4)
>
> (the key part is the “without ever restricting trail bytes to less than
> 80..BF”)
>

I also agree with that, due to access in strings from random position: if
you access it from byte 0x89, you can assume it's a trialing byte and
you'll want to look backward, and will see 0xc3,0x89 which will decode
correctly as U+00C9 without any error detected.

So the wrong bytes are only the initial two occurences of 0x80 which are
individually converted to U+FFFD.

In summary: when you detect any ill-formed sequence, only replace the first
code unit by U+FFFD and restart scanning from the next code unit, without
skeeping over multiple bytes.

This means that multiple occurences of U+FFFD is not only the best
practice, it also matches the intended design of UTF-8 to allow access from
random positions.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton
 wrote:
> On 16 May 2017, at 09:31, Henri Sivonen via Unicode  
> wrote:
>>
>> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
>>  wrote:
>>> That would be true if the in-memory representation had any effect on what 
>>> we’re talking about, but it really doesn’t.
>>
>> If the internal representation is UTF-16 (or UTF-32), it is a likely
>> design that there is a variable into which the scalar value of the
>> current code point is accumulated during UTF-8 decoding.
>
> That’s quite a likely design with a UTF-8 internal representation too; it’s 
> just that you’d only decode during processing, as opposed to immediately at 
> input.

The time to generate the U+FFFDs is at the input time which is what's
at issue here. The later processing, which may then involve iterating
by code point and involving computing the scalar values is a different
step that should be able to assume valid UTF-8 and not be concerned
with invalid UTF-8. (To what extent different programming languages
and frameworks allow confident maintenance of the invariant that after
input all in-RAM UTF-8 can be treated as valid varies.)

>> When the internal representation is UTF-8, only UTF-8 validation is
>> needed, and it's natural to have a fail-fast validator, which *doesn't
>> necessarily need such a scalar value accumulator at all*.
>
> Sure.  But a state machine can still contain appropriate error states without 
> needing an accumulator.

As I said upthread, it could, but it seems inappropriate to ask
implementations to take on that extra complexity on as weak grounds as
"ICU does it" or "feels right" when the current recommendation doesn't
call for those extra states and the current spec is consistent with a
number of prominent non-ICU implementations, including Web browsers.

>>> In what sense is this “interop”?
>>
>> In the sense that prominent independent implementations do the same
>> externally observable thing.
>
> The argument is, I think, that in this case the thing they are doing is the 
> *wrong* thing.

It's seems weird to characterize following the currently-specced "best
practice" as "wrong" without showing a compelling fundamental flaw
(such as a genuine security problem) in the currently-specced "best
practice". With implementations of the currently-specced "best
practice" already shipped, I don't think aesthetic preferences should
be considered enough of a reason to proclaim behavior adhering to the
currently-specced "best practice" as "wrong".

>  That many of them do it would only be an argument if there was some reason 
> that it was desirable that they did it.  There doesn’t appear to be such a 
> reason, unless you can think of something that hasn’t been mentioned thus far?

I've already given a reason: UTF-8 validation code not needing to have
extra states catering to aesthetic considerations of U+FFFD
consolidation.

>  The only reason you’ve given, to date, is that they currently do that, so 
> that should be the recommended behaviour (which is little different from the 
> argument - which nobody deployed - that ICU currently does the other thing, 
> so *that* should be the recommended behaviour; the only difference is that 
> *you* care about browsers and don’t care about ICU, whereas you yourself 
> suggested that some of us might be advocating this decision because we care 
> about ICU and not about e.g. browsers).

Not just browsers. Also OpenJDK and Python 3. Do I really need to test
the standard libraries of more languages/systems to more strongly make
the case that the ICU behavior (according to the proposal PDF) is not
the norm and what the spec currently says is?

> I’ll add also that even among the implementations you cite, some of them 
> permit surrogates in their UTF-8 input (i.e. they’re actually processing 
> CESU-8, not UTF-8 anyway).  Python, for example, certainly accepts the 
> sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” 
> implementation that conformed literally to the recommendation, as you seem to 
> want, should instead replace it with *four* U+FFFDs (I think), no?

I see that behavior in Python 2. Earlier, I said that Python 3 agrees
with the current spec for my test case. The Python 2 behavior I see is
not just against "best practice" but obviously incompliant.

(For details: I tested Python 2.7.12 and 3.5.2 as shipped on Ubuntu 16.04.)

> One additional note: the standard codifies this behaviour as a 
> *recommendation*, not a requirement.

This is an odd argument in favor of changing it. If the argument is
that it's just a recommendation that you don't need to adhere to,
surely then the people who don't like the current recommendation
should choose not to adhere to it instead of advocating changing it.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 09:31, Henri Sivonen via Unicode  wrote:
> 
> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
>  wrote:
>> That would be true if the in-memory representation had any effect on what 
>> we’re talking about, but it really doesn’t.
> 
> If the internal representation is UTF-16 (or UTF-32), it is a likely
> design that there is a variable into which the scalar value of the
> current code point is accumulated during UTF-8 decoding.

That’s quite a likely design with a UTF-8 internal representation too; it’s 
just that you’d only decode during processing, as opposed to immediately at 
input.

> When the internal representation is UTF-8, only UTF-8 validation is
> needed, and it's natural to have a fail-fast validator, which *doesn't
> necessarily need such a scalar value accumulator at all*.

Sure.  But a state machine can still contain appropriate error states without 
needing an accumulator.  That the ones you care about currently don’t is 
readily apparent, but there’s nothing stopping them from doing so.

I don’t see this as an argument about implementations, since it really makes 
very little difference to the implementation which approach is taken; in both 
internal representations, the question is whether you generate U+FFFD 
immediately on detection of the first incorrect *byte*, or whether you do so 
after reading a complete sequence.  UTF-8 sequences are bounded anyway, so it 
isn’t as if failing early gives you any significant performance benefit.

>> In what sense is this “interop”?
> 
> In the sense that prominent independent implementations do the same
> externally observable thing.

The argument is, I think, that in this case the thing they are doing is the 
*wrong* thing.  That many of them do it would only be an argument if there was 
some reason that it was desirable that they did it.  There doesn’t appear to be 
such a reason, unless you can think of something that hasn’t been mentioned 
thus far?  The only reason you’ve given, to date, is that they currently do 
that, so that should be the recommended behaviour (which is little different 
from the argument - which nobody deployed - that ICU currently does the other 
thing, so *that* should be the recommended behaviour; the only difference is 
that *you* care about browsers and don’t care about ICU, whereas you yourself 
suggested that some of us might be advocating this decision because we care 
about ICU and not about e.g. browsers).

I’ll add also that even among the implementations you cite, some of them permit 
surrogates in their UTF-8 input (i.e. they’re actually processing CESU-8, not 
UTF-8 anyway).  Python, for example, certainly accepts the sequence [ed a0 bd 
ed b8 80] and decodes it as U+1F600; a true “fast fail” implementation that 
conformed literally to the recommendation, as you seem to want, should instead 
replace it with *four* U+FFFDs (I think), no?

One additional note: the standard codifies this behaviour as a 
*recommendation*, not a requirement.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode

> On 16 May 2017, at 10:29, David Starner  wrote:
> 
> On Tue, May 16, 2017 at 1:45 AM Alastair Houghton 
>  wrote:
> That’s true anyway; imagine the database holds raw bytes, that just happen to 
> decode to U+FFFD.  There might seem to be *two* names that both contain 
> U+FFFD in the same place.  How do you distinguish between them?
> 
>> If the database holds raw bytes, then the name is a byte string, not a 
>> Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule 
>> to make and enforce that a string in a database is a validly formatted 
>> string; I would hope that most SQL servers do in fact reject malformed UTF-8 
>> strings. On the other hand, I'd expect that an SQL server would accept 
>> U+FFFD in a Unicode string.

Databases typically separate the encoding in which strings are stored from the 
encoding in which an application connected to the database is operating.  A 
database might well hold data in (say) ISO Latin 1, EUC-JP, or indeed any other 
character set, while presenting it to a client application as UTF-8 or UTF-16.  
Hence my comment - application software could very well see two names that are 
apparently identical and that include U+FFFDs in the same places, even though 
the database back-end actually has different strings.  As I said, this is a 
problem we already have.

> I don’t see a problem; the point is that where a structurally valid UTF-8 
> encoding has been used, albeit in an invalid manner (e.g. encoding a number 
> that is not a valid code point, or encoding a valid code point as an 
> over-long sequence), a single U+FFFD is appropriate.  That seems a perfectly 
> sensible rule to adopt.
>  
>> It seems like a perfectly arbitrary rule to adopt; I'd like to assume that 
>> the only source of such UTF-8 data is willful attempts to break security, 
>> and in that case, how is this a win? Nonattack sources of broken data are 
>> much more likely to be the result of mixing UTF-8 with other character 
>> encodings or raw binary data.

I’d say there are three sources of UTF-8 data of that ilk:

(a) bugs,
(b) “Modified UTF-8” and “CESU-8” implementations,
(c) wilful attacks

(b) in particular is quite common, and the result of the presently recommended 
approach doesn’t make much sense there ([c0 80] will get replaced with *two* 
U+FFFDs, while [ed a0 bd ed b8 80] will be replaced by *four* U+FFFDs - 
surrogates aren’t supposed to be valid in UTF-8, right?)

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread David Starner via Unicode
On Tue, May 16, 2017 at 1:45 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:

> That’s true anyway; imagine the database holds raw bytes, that just happen
> to decode to U+FFFD.  There might seem to be *two* names that both contain
> U+FFFD in the same place.  How do you distinguish between them?
>

If the database holds raw bytes, then the name is a byte string, not a
Unicode string, and can't contain U+FFFD at all. It's a relatively easy
rule to make and enforce that a string in a database is a validly formatted
string; I would hope that most SQL servers do in fact reject malformed
UTF-8 strings. On the other hand, I'd expect that an SQL server would
accept U+FFFD in a Unicode string.


> I don’t see a problem; the point is that where a structurally valid UTF-8
> encoding has been used, albeit in an invalid manner (e.g. encoding a number
> that is not a valid code point, or encoding a valid code point as an
> over-long sequence), a single U+FFFD is appropriate.  That seems a
> perfectly sensible rule to adopt.
>

It seems like a perfectly arbitrary rule to adopt; I'd like to assume that
the only source of such UTF-8 data is willful attempts to break security,
and in that case, how is this a win? Nonattack sources of broken data are
much more likely to be the result of mixing UTF-8 with other character
encodings or raw binary data.

>


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode

> On 16 May 2017, at 09:18, David Starner  wrote:
> 
> On Tue, May 16, 2017 at 12:42 AM Alastair Houghton 
>  wrote:
>> If you’re about to mutter something about security, consider this: security 
>> code *should* refuse to compare strings that contain U+FFFD (or at least 
>> should never treat them as equal, even to themselves), because it has no way 
>> to know what that code point represents.
>> 
> Which causes various other security problems; if an object (file, database 
> element, etc.) gets a name with a FFFD in it, it becomes impossible to 
> reference. That an IEEE 754 float may not equal itself is a perpetual source 
> of confusion for programmers.

That’s true anyway; imagine the database holds raw bytes, that just happen to 
decode to U+FFFD.  There might seem to be *two* names that both contain U+FFFD 
in the same place.  How do you distinguish between them?

Clearly if you are holding Unicode code points that you know are validly 
encoded somehow, you may want to be able to match U+FFFDs, but that’s a special 
case where you have extra knowledge.

> In this case, It's pretty clear, but I don't see it as a general rule.  Any 
> rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake 
> or random binary data.

I don’t see a problem; the point is that where a structurally valid UTF-8 
encoding has been used, albeit in an invalid manner (e.g. encoding a number 
that is not a valid code point, or encoding a valid code point as an over-long 
sequence), a single U+FFFD is appropriate.  That seems a perfectly sensible 
rule to adopt.

The proposal actually does cover things that aren’t structurally valid, like 
your e0 e0 e0 example, which it suggests should be a single U+FFFD because the 
initial e0 denotes a three byte sequence, and your 80 80 80 example, which it 
proposes should constitute three illegal subsequences (again, both reasonable). 
 However, I’m not entirely certain about things like

  e0 e0 c3 89

which the proposal would appear to decode as

  U+FFFD U+FFFD U+FFFD U+FFFD  (3)

instead of a perhaps more reasonable

  U+FFFD U+FFFD U+00C9 (4)

(the key part is the “without ever restricting trail bytes to less than 80..BF”)

and if Markus or others could explain why they chose (3) over (4) I’d be quite 
interested to hear the explanation.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag  wrote:
> but I think the way he raises this point is needlessly antagonistic.

I apologize. My level of dismay at the proposal's ICU-centricity overcame me.

On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
 wrote:
> That would be true if the in-memory representation had any effect on what 
> we’re talking about, but it really doesn’t.

If the internal representation is UTF-16 (or UTF-32), it is a likely
design that there is a variable into which the scalar value of the
current code point is accumulated during UTF-8 decoding. In such a
scenario, it can be argued as "natural" to first operate according to
the general structure of UTF-8 and then inspect what you got in the
accumulation variable (ruling out non-shortest forms, values above the
Unicode range and surrogate values after the fact).

When the internal representation is UTF-8, only UTF-8 validation is
needed, and it's natural to have a fail-fast validator, which *doesn't
necessarily need such a scalar value accumulator at all*. The
construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when
used as a UTF-8 validator is the best illustration of a UTF-8
validator not necessarily looking like a "natural" UTF-8 to UTF-16
converter at all.

>>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>>> test with three major browsers that use UTF-16 internally and have
>>> independent (of each other) implementations of UTF-8 decoding
>>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>>> Unicode standard away from that kind of interop needs *way* better
>>> rationale than "feels right”.
>
> In what sense is this “interop”?

In the sense that prominent independent implementations do the same
externally observable thing.

> Under what circumstance would it matter how many U+FFFDs you see?

Maybe it doesn't, but I don't think the burden of proof should be on
the person advocating keeping the spec and major implementations as
they are. If anything, I think those arguing for a change of the spec
in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
with the current spec should show why it's important to have a
different number of U+FFFDs than the spec's "best practice" calls for
now.

>  If you’re about to mutter something about security, consider this: security 
> code *should* refuse to compare strings that contain U+FFFD (or at least 
> should never treat them as equal, even to themselves), because it has no way 
> to know what that code point represents.

In practice, e.g. the Web Platform doesn't allow for stopping
operating on input that contains an U+FFFD, so the focus is mainly on
making sure that U+FFFDs are placed well enough to prevent bad stuff
under normal operations. At least typically, the number of U+FFFDs
doesn't matter for that purpose, but when browsers agree on the number
 of U+FFFDs, changing that number should have an overwhelmingly strong
rationale. A security reason could be a strong reason, but such a
security motivation for fewer U+FFFDs has not been shown, to my
knowledge.

> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD (1)
>
> rather than
>
>   U+FFFD   (2)

I advocate (1), most simply because that's what Firefox, Edge and
Chrome do *in accordance with the currently-recommended best practice*
and, less simply, because it makes sense in the presence of a
fail-fast UTF-8 validator. I think the burden of proof to show an
overwhelmingly good reason to change should, at this point, be on
whoever proposes doing it differently than what the current
widely-implemented spec says.

> It’s pretty clear what the intent of the encoder was there, I’d say, and 
> while we certainly don’t want to decode it as a NUL (that was the source of 
> previous security bugs, as I recall), I also don’t see the logic in insisting 
> that it must be decoded to *three* code points when it clearly only 
> represented one in the input.

As noted previously, the logic is that you generate a U+FFFD whenever
a fail-fast validator fails.

> This isn’t just a matter of “feels nicer”.  (1) is simply illogical 
> behaviour, and since behaviours (1) and (2) are both clearly out there today, 
> it makes sense to pick the more logical alternative as the official 
> recommendation.

Again, the current best practice makes perfect logical sense in the
context of a fail-fast UTF-8 validator. Moreover, it doesn't look like
both are "out there" equally when major browsers, OpenJDK and Python 3
agree. (I expect I could find more prominent implementations that
implement the currently-stated best practice, but I feel I shouldn't
have to.) From my experience from working on Web standards and
implementing them, I think it's 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread David Starner via Unicode
On Tue, May 16, 2017 at 12:42 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:

> If you’re about to mutter something about security, consider this:
> security code *should* refuse to compare strings that contain U+FFFD (or at
> least should never treat them as equal, even to themselves), because it has
> no way to know what that code point represents.
>

Which causes various other security problems; if an object (file, database
element, etc.) gets a name with a FFFD in it, it becomes impossible to
reference. That an IEEE 754 float may not equal itself is a perpetual
source of confusion for programmers.


> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD (1)
>
> rather than
>
>   U+FFFD   (2)
>
> It’s pretty clear what the intent of the encoder was there, I’d say, and
> while we certainly don’t want to decode it as a NUL (that was the source of
> previous security bugs, as I recall), I also don’t see the logic in
> insisting that it must be decoded to *three* code points when it clearly
> only represented one in the input.
>

In this case, It's pretty clear, but I don't see it as a general rule.  Any
rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or
mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not
going to insist that it get replaced with U+FFFD U+FFFD because it's clear
(to me) it was meant as two characters.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 10:01:03 +0300
Henri Sivonen via Unicode  wrote:

> Even so, I think even changing a recommendation of "best practice"
> needs way better rationale than "feels right" or "ICU already does it"
> when a) major browsers (which operate in the most prominent
> environment of broken and hostile UTF-8) agree with the
> currently-recommended best practice and b) the currently-recommended
> best practice makes more sense for implementations where "UTF-8
> decoding" is actually mere "UTF-8 validation".

There was originally an attempt to prescribe rather than to recommend
the interpretation of ill-formed 8-bit Unicode strings.  It may even
briefly have been an issued prescription, until common sense prevailed.
I do remember a sinking feeling when I thought I would have to change
my own handling of bogus UTF-8, only to be relieved later when it
became mere best practice.  However, it is not uncommon for coding
standards to prescribe 'best practice'.

Richard.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread J Decker via Unicode
On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
>  wrote:
> > I’m not sure how the discussion of “which is better” relates to the
> > discussion of ill-formed UTF-8 at all.
>
> Clearly, the "which is better" issue is distracting from the
> underlying issue. I'll clarify what I meant on that point and then
> move on:
>
> I acknowledge that UTF-16 as the internal memory representation is the
> dominant design. However, because UTF-8 as the internal memory
> representation is *such a good design* (when legacy constraits permit)
> that *despite it not being the current dominant design*, I think the
> Unicode Consortium should be fully supportive of UTF-8 as the internal
> memory representation and not treat UTF-16 as the internal
> representation as the one true way of doing things that gets
> considered when speccing stuff.
>
> I.e. I wasn't arguing against UTF-16 as the internal memory
> representation (for the purposes of this thread) but trying to
> motivate why the Consortium should consider "UTF-8 internally" equally
> despite it not being the dominant design.
>
> So: When a decision could go either way from the "UTF-16 internally"
> perspective, but one way clearly makes more sense from the "UTF-8
> internally" perspective, the "UTF-8 internally" perspective should be
> decisive in *such a case*. (I think the matter at hand is such a
> case.)
>
> At the very least a proposal should discuss the impact on the "UTF-8
> internally" case, which the proposal at hand doesn't do.
>
> (Moving on to a different point.)
>
> The matter at hand isn't, however, a new green-field (in terms of
> implementations) issue to be decided but a proposed change to a
> standard that has many widely-deployed implementations. Even when
> observing only "UTF-16 internally" implementations, I think it would
> be appropriate for the proposal to include a review of what existing
> implementations, beyond ICU, do.
>
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome)


Something I've learned through working with Node (V8 javascript engine from
chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is
not one OR the other...

https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY

and I wouldn't really assume UTF-16 is a 'majority';  Go is utf-8 for
instance.



> shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
>


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 08:22, Asmus Freytag via Unicode  wrote:

> I therefore think that Henri has a point when he's concerned about tacit 
> assumptions favoring one memory representation over another, but I think the 
> way he raises this point is needlessly antagonistic.

That would be true if the in-memory representation had any effect on what we’re 
talking about, but it really doesn’t.

(The only time I can think of that the in-memory representation has a 
significant effect is where you’re talking about default binary ordering of 
string data, in which case, in the presence of non-BMP characters, UTF-8 and 
UCS-4 sort the same way, but because the surrogates are “in the wrong place”, 
UTF-16 doesn’t.  I think everyone is well aware of that, no?)

>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>> test with three major browsers that use UTF-16 internally and have
>> independent (of each other) implementations of UTF-8 decoding
>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>> Unicode standard away from that kind of interop needs *way* better
>> rationale than "feels right”.

In what sense is this “interop”?  Under what circumstance would it matter how 
many U+FFFDs you see?  If you’re about to mutter something about security, 
consider this: security code *should* refuse to compare strings that contain 
U+FFFD (or at least should never treat them as equal, even to themselves), 
because it has no way to know what that code point represents.

Would you advocate replacing

  e0 80 80

with

  U+FFFD U+FFFD U+FFFD (1)

rather than

  U+FFFD   (2)

It’s pretty clear what the intent of the encoder was there, I’d say, and while 
we certainly don’t want to decode it as a NUL (that was the source of previous 
security bugs, as I recall), I also don’t see the logic in insisting that it 
must be decoded to *three* code points when it clearly only represented one in 
the input.

This isn’t just a matter of “feels nicer”.  (1) is simply illogical behaviour, 
and since behaviours (1) and (2) are both clearly out there today, it makes 
sense to pick the more logical alternative as the official recommendation.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 15 May 2017, at 23:43, Richard Wordingham via Unicode  
wrote:
> 
> The problem with surrogates is inadequate testing.  They're sufficiently
> rare for many users that it may be a long time before an error is
> discovered.  It's not always obvious that code is designed for UCS-2
> rather than UTF-16.

While I don’t think we should spend too long debating the relative merits of 
UTF-8 versus UTF-16, I’ll note that that argument applies equally to both 
combining characters and indeed the underlying UTF-8 encoding in the first 
place, and that mistakes in handling both are not exactly uncommon.  There are 
advantages to UTF-8 and advantages to UTF-16.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen  wrote:
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome) shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".

Testing with that file, Python 3 and OpenJDK 8 agree with the
currently-specced best-practice, too. I expect there to be other
well-known implementations that comply with the currently-specced best
practice, so the rationale to change the stated best practice would
have to be very strong (as in: security problem with currently-stated
best practice) for a change to be appropriate.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Asmus Freytag via Unicode

On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote:

On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
 wrote:

I’m not sure how the discussion of “which is better” relates to the
discussion of ill-formed UTF-8 at all.

Clearly, the "which is better" issue is distracting from the
underlying issue. I'll clarify what I meant on that point and then
move on:

I acknowledge that UTF-16 as the internal memory representation is the
dominant design. However, because UTF-8 as the internal memory
representation is *such a good design* (when legacy constraits permit)
that *despite it not being the current dominant design*, I think the
Unicode Consortium should be fully supportive of UTF-8 as the internal
memory representation and not treat UTF-16 as the internal
representation as the one true way of doing things that gets
considered when speccing stuff.
There are cases where it is prohibitive to transcode external data from 
UTF-8 to any other format, as a precondition to doing any work. In these 
situations processing has to be done in UTF-8, effectively making that 
the in-memory representation. I've encountered this issue on separate 
occasions, both for my own code as well as code I reviewed for clients.


I therefore think that Henri has a point when he's concerned about tacit 
assumptions favoring one memory representation over another, but I think 
the way he raises this point is needlessly antagonistic.

At the very least a proposal should discuss the impact on the "UTF-8
internally" case, which the proposal at hand doesn't do.


This is a key point. It may not be directly relevant to any other 
modifications to the standard, but the larger point is to not make 
assumption about how people implement the standard (or any of the 
algorithms).

(Moving on to a different point.)

The matter at hand isn't, however, a new green-field (in terms of
implementations) issue to be decided but a proposed change to a
standard that has many widely-deployed implementations. Even when
observing only "UTF-16 internally" implementations, I think it would
be appropriate for the proposal to include a review of what existing
implementations, beyond ICU, do.

I would like to second this as well.

The level of documented review of existing implementation practices 
tends to be thin (at least thinner than should be required for changing 
long-established edge cases or recommendations, let alone core  
conformance requirements).


Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
test with three major browsers that use UTF-16 internally and have
independent (of each other) implementations of UTF-8 decoding
(Firefox, Edge and Chrome) shows agreement on the current spec: there
is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
6 on the second, 4 on the third and 6 on the last line). Changing the
Unicode standard away from that kind of interop needs *way* better
rationale than "feels right".
It would be good if the UTC could work out some minimal requirements for 
evaluating proposals for changes to properties and algorithms, much like 
the criteria for encoding new code points

A./


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 15 May 2017, at 23:16, Shawn Steele via Unicode  wrote:
> 
> I’m not sure how the discussion of “which is better” relates to the 
> discussion of ill-formed UTF-8 at all.

It doesn’t, which is a point I made in my original reply to Henry.  The only 
reason I answered his anti-UTF-16 rant at all was to point out that some of us 
don’t think UTF-16 is a mistake, and in fact can see various benefits 
(*particularly* as an in-memory representation).

> And to the last, saying “you cannot process UTF-16 without handling 
> surrogates” seems to me to be the equivalent of saying “you cannot process 
> UTF-8 without handling lead & trail bytes”.  That’s how the respective 
> encodings work.

Quite.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 6:23 AM, Karl Williamson
 wrote:
> On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote:
>>
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
>>
>> The proposal is to make ICU's spec violation conforming. I think there
>> is both a technical and a political reason why the proposal is a bad
>> idea.
>
>
>
> Henri's claim that "The proposal is to make ICU's spec violation conforming"
> is a false statement, and hence all further commentary based on this false
> premise is irrelevant.
>
> I believe that ICU is actually currently conforming to TUS.

Do you mean that ICU's behavior differs from what the PDF claims (I
didn't test and took the assertion in the PDF about behavior at face
value) or do you mean that despite deviating from the
currently-recommended best practice the behavior is conforming,
because the relevant part of the spec is mere best practice and not a
requirement?

> TUS has certain requirements for UTF-8 handling, and it has certain other
> "Best Practices" as detailed in 3.9.  The proposal involves changing those
> recommendations.  It does not involve changing any requirements.

Even so, I think even changing a recommendation of "best practice"
needs way better rationale than "feels right" or "ICU already does it"
when a) major browsers (which operate in the most prominent
environment of broken and hostile UTF-8) agree with the
currently-recommended best practice and b) the currently-recommended
best practice makes more sense for implementations where "UTF-8
decoding" is actually mere "UTF-8 validation".

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
 wrote:
> I’m not sure how the discussion of “which is better” relates to the
> discussion of ill-formed UTF-8 at all.

Clearly, the "which is better" issue is distracting from the
underlying issue. I'll clarify what I meant on that point and then
move on:

I acknowledge that UTF-16 as the internal memory representation is the
dominant design. However, because UTF-8 as the internal memory
representation is *such a good design* (when legacy constraits permit)
that *despite it not being the current dominant design*, I think the
Unicode Consortium should be fully supportive of UTF-8 as the internal
memory representation and not treat UTF-16 as the internal
representation as the one true way of doing things that gets
considered when speccing stuff.

I.e. I wasn't arguing against UTF-16 as the internal memory
representation (for the purposes of this thread) but trying to
motivate why the Consortium should consider "UTF-8 internally" equally
despite it not being the dominant design.

So: When a decision could go either way from the "UTF-16 internally"
perspective, but one way clearly makes more sense from the "UTF-8
internally" perspective, the "UTF-8 internally" perspective should be
decisive in *such a case*. (I think the matter at hand is such a
case.)

At the very least a proposal should discuss the impact on the "UTF-8
internally" case, which the proposal at hand doesn't do.

(Moving on to a different point.)

The matter at hand isn't, however, a new green-field (in terms of
implementations) issue to be decided but a proposed change to a
standard that has many widely-deployed implementations. Even when
observing only "UTF-16 internally" implementations, I think it would
be appropriate for the proposal to include a review of what existing
implementations, beyond ICU, do.

Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
test with three major browsers that use UTF-16 internally and have
independent (of each other) implementations of UTF-8 decoding
(Firefox, Edge and Chrome) shows agreement on the current spec: there
is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
6 on the second, 4 on the third and 6 on the last line). Changing the
Unicode standard away from that kind of interop needs *way* better
rationale than "feels right".

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Karl Williamson via Unicode

On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote:

In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

The proposal is to make ICU's spec violation conforming. I think there
is both a technical and a political reason why the proposal is a bad
idea.



Henri's claim that "The proposal is to make ICU's spec violation 
conforming" is a false statement, and hence all further commentary based 
on this false premise is irrelevant.


I believe that ICU is actually currently conforming to TUS.

The proposal reads:

"For UTF-8, recommend evaluating maximal subsequences based on the 
original structural definition of UTF-8..."


There is nothing in here that is requiring any implementation to be 
changed.  The word "recommend" does not mean the same as "require". 
Have you guys been so caught up in the current international political 
situation that you have lost the ability to read straight?


TUS has certain requirements for UTF-8 handling, and it has certain 
other "Best Practices" as detailed in 3.9.  The proposal involves 
changing those recommendations.  It does not involve changing any 
requirements.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Philippe Verdy via Unicode
Softwares designed with only UCS-2 and not real UTF-16 support are still
used today

For example MySQL with its broken "UTF-8" encoding which in fact encodes
supplementary characters as two separate 16-bit code-units for surrogates,
each one blindly encoded as 3-byte sequences which would be ill-formed in
standard UTF-8, buit that also does not differentiate invalid pairs of
surrogates, and offers no collation support for supplementary characters.

In this case some other softwares will break silently on these sequences
(for example Mediawiki when installed with a MySQL backend server whose
datastore was created with its broken "UTF-8", will silently discard any
text starting at the first supplementary character found in the wikitext.
This is not a problem of Mediawiki but the fact the MediaWiki does NOT
support such MySQL server isntalled with its "UTF-8" datastore, but only
supports MySQL if the storage encoding declared for the database was
"binary" (but in that case there's no support of collation in MySQL, texts
are just containing any random sequences of bytes and internationalization
is then made in the client software, here Mediawiki and its PHP, ICU, or
Lua libraries, and other tools written in Perl and other languages)

Note that this does not affect Wikimedia in its wikis because they were
initially installed corectly with the binary encoding in MySQL, but now
Wikimedia wikis use another database engine with native UTF-8 support and
full coverage of the UCS. Other wikis using Mediawiki will need to upgrade
their MySQL version if they want to keep it for adminsitrative reasons (and
not convert again their datastore to the binary encoding).

Softwares running with only UCS-2 are exposed to such risks similar to the
one seen in MediaWiki on incorrect MySQL installations, where any user may
edit a page to insert any supplementary character (supplementary sinograms,
emojis, Gothic letters, supplementary symbols...) which will look correct
when previewing, and correct when it is parsed, accepted silently by MySQL,
but then silently truncated because of the encoding error: when reloading
the data from MySQL, there will effectively be unexpectedly discarded data.

How to react to the risks of data losses or truncation ? Throwing an
exception or just returning an error is in fact more dangerous than just
replacing the ill-formed sequences by one or more U+FFFD: we preserve as
much as possible, but anyway softwares should be able to perform some tests
in their datastore to see if they correctly handle the encoding: this could
be done when starting the sofware and emitting log messages when the
backend do not support the encoding: all that is needed is to send a single
supplementary character to the remote datastore in a junk table or field
and then retrieve it immediately in another transaction to make sure it is
preserved. Similar tests can be done to see if the remote datastore also
preserves the encoding form or "normalizes it, or alters it (this
alteration could happen with a leading BOM and some other silent
alterations could be made on NULL and trailing spaces if the datastore does
not use text fields with varying length but fixed length instead). Similar
tests could be done to check the maximum length accepted (a VARCHAR(256) on
a binary-encoded database will not always store 256 Unciode characters, but
in a database encoded with non borken UTF-8, it should store 256 codepoints
independantly of theior values, even if their UTF-8 encoding would be up to
1024 bytes.


2017-05-16 0:43 GMT+02:00 Richard Wordingham via Unicode <
unicode@unicode.org>:

> On Mon, 15 May 2017 21:38:26 +
> David Starner via Unicode  wrote:
>
> > > and the fact is that handling surrogates (which is what proponents
> > > of UTF-8 or UCS-4 usually focus on) is no more complicated than
> > > handling combining characters, which you have to do anyway.
>
> > Not necessarily; you can legally process Unicode text without worrying
> > about combining characters, whereas you cannot process UTF-16 without
> > handling surrogates.
>
> The problem with surrogates is inadequate testing.  They're sufficiently
> rare for many users that it may be a long time before an error is
> discovered.  It's not always obvious that code is designed for UCS-2
> rather than UTF-16.
>
> Richard.
>


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Philippe Verdy via Unicode
2017-05-15 19:54 GMT+02:00 Asmus Freytag via Unicode :

> I think this political reason should be taken very seriously. There are
> already too many instances where ICU can be seen "driving" the development
> of property and algorithms.
>
> Those involved in the ICU project may not see the problem, but I agree
> with Henri that it requires a bit more sensitivity from the UTC.
>
I don't think that the fact that ICU was originately using UTF-16
internally has ANY effect on the decision to represent ill-formed sequences
as single or multiple U+FFFD.
The internal encoding has nothing in common with the external encoding used
when processing input data (which may be UTf-8, UTF-16, UTF-32, and could
in all case present ill-formed sequences). That internal encoding here will
paly no role in how to convert the ill-formed input, or if it will be
converted.
So yes, independantly of the internal encoding, we'll still ahve to choose
between:
- not converting the input and return an error or throw an exception
- converting the input using a single U+FFFD (in its internal
representation, this does not matter) to replace the complete sequence of
ill-formed code units in the input data, and preferably return an error
status
- converting the input using as many U+FFFD (in its internal
representation, this does not matter)  to replace every ocurence of
ill-formed code units in the input data, and preferably return an error
status.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread David Starner via Unicode
On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode <
unicode@unicode.org> wrote:

> Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the
> case for other situations


UTF-8 is clearly more efficient space-wise that includes more ASCII
characters than characters between U+0800 and U+. Given the prevalence
of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic
will pretty much always be smaller in UTF-8.

Even for scripts that go from 2 bytes to 3, webpages can get much smaller
in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a
factor of 1.8). The max change in reverse is 1.5, as two bytes goes to
three.


> and the fact is that handling surrogates (which is what proponents of
> UTF-8 or UCS-4 usually focus on) is no more complicated than handling
> combining characters, which you have to do anyway.
>

Not necessarily; you can legally process Unicode text without worrying
about combining characters, whereas you cannot process UTF-16 without
handling surrogates.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode

On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote:



ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.

My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

UTF-16 has some nice properties and there's not need to brand it a 
"mistake". UTF-8 has different nice properties, but there's equally not 
reason to treat it as more special than UTF-16.


The UTC should adopt a position of perfect neutrality when it comes to 
assuming in-memory representation, in other words, not make assumptions 
that optimizing for any encoding form will benefit implementers.


UTC, where ICU is strongly represented, needs to guard against basing 
encoding/properties/algorithm decisions (edge cases mostly), solely or 
primarily on the needs of a particular implementation that happens to be 
chosen by the ICU project.


A./



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Shawn Steele via Unicode
>> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
>> multiple errors there makes no sense.
> 
> Changing a specification as fundamental as this is something that should not 
> be undertaken lightly.

IMO, the only think that can be agreed upon is that "something's bad with this 
UTF-8 data".  I think that whether it's treated as a single group of corrupt 
bytes or each individual byte is considered a problem should be up to the 
implementation.

#1 - This data should "never happen".  In a system behaving normally, this 
condition should never be encountered.  
  * At this point the data is "bad" and all bets are off.
  * Some applications may have a clue how the bad data could have happened and 
want to do something in particular.
  * It seems odd to me to spend much effort standardizing a scenario that 
should be impossible.
#2 - Depending on implementation, either behavior, or some combination, may be 
more efficient.  I'd rather allow apps to optimize for the common case, not the 
case-that-shouldn't-ever-happen
#3 - We have no clue if this "maximal" sequence was a single error, 2 errors, 
or even more.  The lead byte says how many trail bytes should follow, and those 
should be in a certain range.  Values outside of those conditions are illegal, 
so we shouldn't ever encounter them.  So if we did, then something really weird 
happened.  
  * Did a single character get misencoded?
  * Was an illegal sequence illegally encoded?
  * Perhaps a byte got corrupted in transmission?
  * Maybe we dropped a packet/block, so this is really the beginning of a valid 
sequence and the tail of another completely valid sequence?

In practice, all that most apps would be able to do would be to say "You have 
bad data, how bad I have no clue, but it's not right".  A single bit could've 
flipped, or you could have only 3 pages of a 4000 page document.  No clue at 
all.  At that point it doesn't really matter how many FFFD's the error(s) are 
replaced with, and no assumptions should be made about the severity of the 
error.

-Shawn



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton
 wrote:
> On 15 May 2017, at 11:21, Henri Sivonen via Unicode  
> wrote:
>>
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
>
> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
> multiple errors there makes no sense.

The currently-specced behavior makes perfect sense when you add error
emission on top of a fail-fast UTF-8 validation state machine.

>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
>> representative of implementation concerns of implementations that use
>> UTF-8 as their in-memory Unicode representation.
>>
>> Even though there are notable systems (Win32, Java, C#, JavaScript,
>> ICU, etc.) that are stuck with UTF-16 as their in-memory
>> representation, which makes concerns of such implementation very
>> relevant, I think the Unicode Consortium should acknowledge that
>> UTF-16 was, in retrospect, a mistake
>
> You may think that.  There are those of us who do not.

My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Alastair Houghton via Unicode
On 15 May 2017, at 18:52, Asmus Freytag  wrote:
> 
> On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:
>> On 15 May 2017, at 11:21, Henri Sivonen via Unicode  
>> wrote:
>>> In reference to:
>>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>> 
>>> I think Unicode should not adopt the proposed change.
>> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
>> multiple errors there makes no sense.
> 
> Changing a specification as fundamental as this is something that should not 
> be undertaken lightly.

Agreed.

> Apparently we have a situation where implementations disagree, and have done 
> so for a while. This normally means not only that the implementations differ, 
> but that data exists in both formats.
> 
> Even if it were true that all data is only stored in UTF-8, any data 
> converted from UFT-8 back to UTF-8 going through an interim stage that 
> requires UTF-8 conversion would then be different based on which converter is 
> used.
> 
> Implementations working in UTF-8 natively would potentially see three formats:
> 1) the original ill-formed data
> 2) data converted with single FFFD
> 3) data converted with multiple FFFD
> 
> These forms cannot be compared for equality by binary matching.

But that was always true, if you were under the impression that only one of (2) 
and (3) existed, and indeed claiming equality between two instances of U+FFFD 
might be problematic itself in some circumstances (you don’t know why the 
U+FFFDs were inserted - they may not replace the same original data).

> The best that can be done is to convert (1) into one of the other forms and 
> then compare treating any run of FFFD code points as equal to any other run, 
> irrespective of length.

It’s probably safer, actually, to refuse to compare U+FFFD as equal to anything 
(even itself) unless a special flag is passed.  For “general purpose” 
applications, you could set that flag and then a single U+FFFD would compare 
equal to another single U+FFFD; no need for the complicated “any string of 
U+FFFD” logic (which in any case makes little sense - it could just as easily 
generate erroneous comparisons as fix the case we’re worrying about here).

> Because we've had years of multiple implementations, it would be expected 
> that copious data exists in all three formats, and that data will not go 
> away. Changing the specification to pick one of these formats as solely 
> conformant is IMHO too late.

I don’t think so.  Even if we acknowledge the possibility of data in the other 
form, I think it’s useful guidance to implementers, both now and in the future. 
 One might even imagine that the other, non-favoured form, would eventually 
fall out of use.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode

On 5/15/2017 3:21 AM, Henri Sivonen via Unicode wrote:

Second, the political reason:

Now that ICU is a Unicode Consortium project, I think the Unicode
Consortium should be particular sensitive to biases arising from being
both the source of the spec and the source of a popular
implementation. It looks*really bad*  both in terms of equal footing
of ICU vs. other implementations for the purpose of how the standard
is developed as well as the reliability of the standard text vs. ICU
source code as the source of truth that other implementors need to pay
attention to if the way the Unicode Consortium resolves a discrepancy
between ICU behavior and a well-known spec provision (this isn't some
ill-known corner case, after all) is by changing the spec instead of
changing ICU*especially*  when the change is not neutral for
implementations that have made different but completely valid per
then-existing spec and, in the absence of legacy constraints, superior
architectural choices compared to ICU (i.e. UTF-8 internally instead
of UTF-16 internally).

I can see the irony of this viewpoint coming from a WHATWG-aligned
browser developer, but I note that even browsers that use ICU for
legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior
isn't, in fact, the dominant browser UTF-8 behavior. That is, even
Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the
environment that's the most sensitive to how issues like this are
handled, so it would be appropriate for the proposal to survey current
browser behavior instead of just saying that ICU "feels right" or is
"natural".


I think this political reason should be taken very seriously. There are 
already too many instances where ICU can be seen "driving" the 
development of property and algorithms.


Those involved in the ICU project may not see the problem, but I agree 
with Henri that it requires a bit more sensitivity from the UTC.


A./



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode

On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:

On 15 May 2017, at 11:21, Henri Sivonen via Unicode  wrote:

In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
multiple errors there makes no sense.


Changing a specification as fundamental as this is something that should 
not be undertaken lightly.


Apparently we have a situation where implementations disagree, and have 
done so for a while. This normally means not only that the 
implementations differ, but that data exists in both formats.


Even if it were true that all data is only stored in UTF-8, any data 
converted from UFT-8 back to UTF-8 going through an interim stage that 
requires UTF-8 conversion would then be different based on which 
converter is used.


Implementations working in UTF-8 natively would potentially see three 
formats:

1) the original ill-formed data
2) data converted with single FFFD
3) data converted with multiple FFFD

These forms cannot be compared for equality by binary matching.

The best that can be done is to convert (1) into one of the other forms 
and then compare treating any run of FFFD code points as equal to any 
other run, irrespective of length.
(For security-critical applications, the presence of any FFFD should 
render the data invalid, so the comparisons we'd be talking about here 
would be for general purpose, like search).


Because we've had years of multiple implementations, it would be 
expected that copious data exists in all three formats, and that data 
will not go away. Changing the specification to pick one of these 
formats as solely conformant is IMHO too late.


A./





ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.  The fact is that UTF-16 
makes sense as a default encoding in many cases.  Yes, UTF-8 is more efficient 
for primarily ASCII text, but that is not the case for other situations and the 
fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 
usually focus on) is no more complicated than handling combining characters, 
which you have to do anyway.


Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.

I don’t think the Unicode Consortium should be unsympathetic to people who use 
UTF-8 internally, for sure, but I don’t see what that has to do with either the 
original proposal or with your criticism of UTF-16.

[snip]


If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.

“Impossible”?  Why?  You just need to add some error states (or *an* error 
state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the 
only library that already did just that *because it’s clearly the right thing 
to do*.

Kind regards,

Alastair.

--
http://alastairs-place.net







Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Alastair Houghton via Unicode
On 15 May 2017, at 11:21, Henri Sivonen via Unicode  wrote:
> 
> In reference to:
> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
> 
> I think Unicode should not adopt the proposed change.

Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
multiple errors there makes no sense.

> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
> representative of implementation concerns of implementations that use
> UTF-8 as their in-memory Unicode representation.
> 
> Even though there are notable systems (Win32, Java, C#, JavaScript,
> ICU, etc.) that are stuck with UTF-16 as their in-memory
> representation, which makes concerns of such implementation very
> relevant, I think the Unicode Consortium should acknowledge that
> UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.  The fact is that UTF-16 
makes sense as a default encoding in many cases.  Yes, UTF-8 is more efficient 
for primarily ASCII text, but that is not the case for other situations and the 
fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 
usually focus on) is no more complicated than handling combining characters, 
which you have to do anyway.

> Therefore, despite UTF-16 being widely used as an in-memory
> representation of Unicode and in no way going away, I think the
> Unicode Consortium should be *very* sympathetic to technical
> considerations for implementations that use UTF-8 as the in-memory
> representation of Unicode.

I don’t think the Unicode Consortium should be unsympathetic to people who use 
UTF-8 internally, for sure, but I don’t see what that has to do with either the 
original proposal or with your criticism of UTF-16.

[snip]

> If the proposed
> change was adopted, while Draconian decoders (that fail upon first
> error) could retain their current state machine, implementations that
> emit U+FFFD for errors and continue would have to add more state
> machine states (i.e. more complexity) to consolidate more input bytes
> into a single U+FFFD even after a valid sequence is obviously
> impossible.

“Impossible”?  Why?  You just need to add some error states (or *an* error 
state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the 
only library that already did just that *because it’s clearly the right thing 
to do*.

Kind regards,

Alastair.

--
http://alastairs-place.net




Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

The proposal is to make ICU's spec violation conforming. I think there
is both a technical and a political reason why the proposal is a bad
idea.

First, the technical reason:

ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake (since Unicode grew past 16 bits
anyway making UTF-16 both variable-width *and*
ASCII-incompatible--i.e. widening the the code units to be
ASCII-incompatible didn't buy a constant-width encoding after all) and
that when the legacy constraints of Win32, Java, C#, JavaScript, ICU,
etc. don't force UTF-16 as the internal Unicode representation, using
UTF-8 as the internal Unicode representation is the technically
superior design: Using UTF-8 as the internal Unicode representation is
memory-efficient and cache-efficient when dealing with data formats
whose syntax is mostly ASCII (e.g. HTML), forces developers to handle
variable-width issues right away, makes input decode a matter of mere
validation without copy when the input is conforming and makes output
encode infinitely fast (no encode step needed).

Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.

When looking this issue from the ICU perspective of using UTF-16 as
the in-memory representation of Unicode, it's easy to consider the
proposed change as the easier thing for implementation (after all, no
change for the ICU implementation is involved!). However, when UTF-8
is the in-memory representation of Unicode and "decoding" UTF-8 input
is a matter of *validating* UTF-8, a state machine that rejects a
sequence as soon as it's impossible for the sequence to be valid UTF-8
(under the definition that excludes surrogate code points and code
points beyond U+10) makes a whole lot of sense. If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.

When the decision can easily go either way for implementations that
use UTF-16 internally but the options are not equal when using UTF-8
internally, the "UTF-8 internally" case should be decisive.
(Especially when spec-wise that decision involves no change. I further
note the proposal PDF argues on the level of "feels right" without
even discussing the impact on implementations that use UTF-8
internally.)

As a matter of implementation experience, the implementation I've
written (https://github.com/hsivonen/encoding_rs) supports both the
UTF-16 as the in-memory Unicode representation and the UTF-8 as the
in-memory Unicode representation scenarios, and the fail-fast
requirement wasn't onerous in the UTF-16 as the in-memory
representation scenario.

Second, the political reason:

Now that ICU is a Unicode Consortium project, I think the Unicode
Consortium should be particular sensitive to biases arising from being
both the source of the spec and the source of a popular
implementation. It looks *really bad* both in terms of equal footing
of ICU vs. other implementations for the purpose of how the standard
is developed as well as the reliability of the standard text vs. ICU
source code as the source of truth that other implementors need to pay
attention to if the way the Unicode Consortium resolves a discrepancy
between ICU behavior and a well-known spec provision (this isn't some
ill-known corner case, after all) is by changing the spec instead of
changing ICU *especially* when the change is not neutral for
implementations that have made different but completely valid per
then-existing spec and, in the absence of legacy constraints, superior
architectural choices compared to ICU (i.e. UTF-8 internally instead
of UTF-16 internally).

I can see the irony of this viewpoint coming from a WHATWG-aligned
browser developer, but I note that even browsers that use ICU for
legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior
isn't, in fact, the dominant browser UTF-8 behavior. That is, even
Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the
environment that's the most sensitive to how issues like 

<    1   2