date:20170516

Another alternative for you API is to not return simple integer values, but
return (read-only) instances of a Char32 class whose "scalar" property
would normally be a valid codepoint with scalar value, or whose "string"
property will be the actual character; but with another static property
"isValidScalar" returning "true"; for other ill-formed
sequences,"isValidScalar" will be false, the scalar value will be the
initial code unit from the input (decoded from the internal representation
in tyhe backing store) and the "string" property will be empty. You may
also add a special "Char32" static instance representing
end-of-file/end-of-string, whose property "isEOF" will be true, and
property scalar will be typically -1, "isValid Scalar" will be false, and
the "string" property will be the empty string.

All this is possible independantly of the internal representation made in
the backing store for its own code units (where it may use any extension of
standard UTF's or any data compression scheme without exposing it)

2017-05-16 23:08 GMT+02:00 Philippe Verdy :

>
>
> 2017-05-16 20:50 GMT+02:00 Shawn Steele :
>
>> But why change a recommendation just because it “feels like”.  As you
>> said, it’s just a recommendation, so if that really annoyed someone, they
>> could do something else (eg: they could use a single FFFD).
>>
>>
>>
>> If the recommendation is truly that meaningless or arbitrary, then we
>> just get into silly discussions of “better” that nobody can really answer.
>>
>>
>>
>> Alternatively, how about “one or more FFFDs?” for the recommendation?
>>
>>
>>
>> To me it feels very odd to perhaps require writing extra code to detect
>> an illegal case.  The “best practice” here should maybe be “one or more
>> FFFDs, whatever makes your code faster”.
>>
>
> Faster ok, privided this does not break other uses, notably for  random
> access within strings, where UTF-8 is designed to allow searching backward
> on a limited number of bytes (maximum 3) in order to find the leading byte,
> and then check its value:
> - if it's not found, return back to the initial position and amke the next
> access return U+FFFD to signal the error of position: this trailing byte is
> part of an ill-formed sequence, and for coherence, any further trailine
> bytes fouind after it will **also** return U+FFFD to be coherent (because
> these other trailing bytes may also be found bby random access to them.
> - it the leading byte is found backward ut does not match the expected
> number of trailing bytes after it, return back to the initial random
> position where you'll return also U+FFFD. This means that the initial
> leading byte (part of the ill-formed sequence) must also return a separate
> U+FFFD, given that each following trailing byte will return U+FFFD
> isolately when accessing to them.
>
> If we want coherent decoding with text handling promitives allowing random
> access with encoded sequences, there's no other choice than treating EACH
> byte part of the ill-formed sequence as individual errors mapped to the
> same replacement code point (U+FFFD if that is what is chosen, but these
> APIs could as well specify annother replacement character or could
> eventually return a non-codepoint if the API return value is not restricted
> to only valid codepoints (for example the replacement could be a negative
> value whose absolute value matches the invalid code unit, or some other
> invalid code unit outside the valid range for code points with scalar
> values: isolated surrogates in UTF-16 for example could be returned as is,
> or made negative either by returning its opposite or by setting (or'ing)
> the most significant bit of the return value).
>
> The problem will arise when you need to store the replacement values if
> the internal backing store is limited to 16-bit code units or 8-bit code
> units: this internal backing store may use its own internal extension of
> standard UTF's, including the possibility of encoding NULLs as C0,80 (like
> what Java does with its "modified UTF-8 internal encoding used in its
> compiled binary classes and serializations), or internally using isolated
> trailing surrogates to store illformed UTF-8 input by or'ing these bytes
> with 0xDC00 that will be returned as an code point with no valid scalar
> value. For internally representing illformed UTF-16 sequences, there's no
> need to change anything. For internally representing ill-formed UTF-32
> sequences (in fact limited to one 32-bitcode unit), with a 16bit internal
> backing store you may need to store 3 16bit values (three isolated trailing
> surrogates). For internally representing ill formed UTF-32 in an 8 bit
> backing store, you could use 0xC1 followed by 5 five trailing bytes (each
> one storing 7 bits of the initial ill-formed code unit from the UTF-32
> input).
>
> What you'll do in the internal backing store will not be exposed to your
> API which will just return either valide

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> Faster ok, privided this does not break other uses, notably for  random 
> access within strings…

Either way, this is a “recommendation”.  I don’t see how that can provide for 
not-“breaking other uses.”  If it’s internal, you can do what you will, so if 
you need the 1:1 seeming parity, then you can do that internally.  But if 
you’re depending on other APIs/libraries/data source/whatever, it would seem 
like you couldn’t count on that.  (And probably shouldn’t even if it was a 
requirement rather than a recommendation).

I’m wary of the idea of attempting random access on a stream that is also 
manipulating the stream at the same time (decoding apparently).

The U+FFFD emitted by this decoding could also require a different # of bytes 
to reencode.  Which might disrupt the presumed parity, depending on how the 
data access was being handled.

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 20:50 GMT+02:00 Shawn Steele :

> But why change a recommendation just because it “feels like”.  As you
> said, it’s just a recommendation, so if that really annoyed someone, they
> could do something else (eg: they could use a single FFFD).
>
>
>
> If the recommendation is truly that meaningless or arbitrary, then we just
> get into silly discussions of “better” that nobody can really answer.
>
>
>
> Alternatively, how about “one or more FFFDs?” for the recommendation?
>
>
>
> To me it feels very odd to perhaps require writing extra code to detect an
> illegal case.  The “best practice” here should maybe be “one or more FFFDs,
> whatever makes your code faster”.
>

Faster ok, privided this does not break other uses, notably for  random
access within strings, where UTF-8 is designed to allow searching backward
on a limited number of bytes (maximum 3) in order to find the leading byte,
and then check its value:
- if it's not found, return back to the initial position and amke the next
access return U+FFFD to signal the error of position: this trailing byte is
part of an ill-formed sequence, and for coherence, any further trailine
bytes fouind after it will **also** return U+FFFD to be coherent (because
these other trailing bytes may also be found bby random access to them.
- it the leading byte is found backward ut does not match the expected
number of trailing bytes after it, return back to the initial random
position where you'll return also U+FFFD. This means that the initial
leading byte (part of the ill-formed sequence) must also return a separate
U+FFFD, given that each following trailing byte will return U+FFFD
isolately when accessing to them.

If we want coherent decoding with text handling promitives allowing random
access with encoded sequences, there's no other choice than treating EACH
byte part of the ill-formed sequence as individual errors mapped to the
same replacement code point (U+FFFD if that is what is chosen, but these
APIs could as well specify annother replacement character or could
eventually return a non-codepoint if the API return value is not restricted
to only valid codepoints (for example the replacement could be a negative
value whose absolute value matches the invalid code unit, or some other
invalid code unit outside the valid range for code points with scalar
values: isolated surrogates in UTF-16 for example could be returned as is,
or made negative either by returning its opposite or by setting (or'ing)
the most significant bit of the return value).

The problem will arise when you need to store the replacement values if the
internal backing store is limited to 16-bit code units or 8-bit code units:
this internal backing store may use its own internal extension of standard
UTF's, including the possibility of encoding NULLs as C0,80 (like what Java
does with its "modified UTF-8 internal encoding used in its compiled binary
classes and serializations), or internally using isolated trailing
surrogates to store illformed UTF-8 input by or'ing these bytes with 0xDC00
that will be returned as an code point with no valid scalar value. For
internally representing illformed UTF-16 sequences, there's no need to
change anything. For internally representing ill-formed UTF-32 sequences
(in fact limited to one 32-bitcode unit), with a 16bit internal backing
store you may need to store 3 16bit values (three isolated trailing
surrogates). For internally representing ill formed UTF-32 in an 8 bit
backing store, you could use 0xC1 followed by 5 five trailing bytes (each
one storing 7 bits of the initial ill-formed code unit from the UTF-32
input).

What you'll do in the internal backing store will not be exposed to your
API which will just return either valide codepoints with valid scalar
values, or values outside the two valid subranges (so it could possibly
negative values, or isolated trailing surrogates). That backing store can
also substitute some valid input causing problems (such as NULLs) using
0xC0 plus another byte, that sequence being unexposed by your API which
will still be able to return the expected codepoints (but with the minor
caveat that the total number of returned codepoints will not match the
actual size allocated for the internal backing store (that applications
using that API won't even need to know how it is internally represented).

In other words: any private extensions are possible internally, but it is
possible to isolate it within a blackboxing API which will still be able to
chose how to represent the input text (it may as well use a zlib-compressed
backing store, or some stateless Huffmann compression based on a static
statistic table configured and stored elsewhere, intiialized when you first
instantiate your API).

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, 16 May 2017 11:36:39 -0700
Markus Scherer via Unicode  wrote:

> Why do we care how we carve up an illegal sequence into subsequences?
> Only for debugging and visual inspection. Maybe some process is using
> illegal, overlong sequences to encode something special (à la Java
> string serialization, "modified UTF-8"), and for that it might be
> convenient too to treat overlong sequences as single errors.

I think that's not quite true.  If we are moving back and forth through
a buffer containing corrupt text, we need to make sure that moving three
characters forward and then three characters back leaves us where we
started.  That requires internal consistency.

One possible issue is with text input methods that access an
application's backing store.  They can issue updates in the form of
'delete 3 characters and insert ...'.  However, if the input method is
accessing characters it hasn't written, it's probably misbehaving
anyway.  Such commands do rather heavily assume that any
relevant normalisation by the application will be taken into account by
the input method.  I once had a go at fixing an application that was
misinterpreting 'delete x characters' as 'delete x UTF-16 code units'.
It was a horrible mess, as the application's interface layer couldn't
peek at the string being edited.

Richard.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

But why change a recommendation just because it “feels like”.  As you said, 
it’s just a recommendation, so if that really annoyed someone, they could do 
something else (eg: they could use a single FFFD).

If the recommendation is truly that meaningless or arbitrary, then we just get 
into silly discussions of “better” that nobody can really answer.

Alternatively, how about “one or more FFFDs?” for the recommendation?

To me it feels very odd to perhaps require writing extra code to detect an 
illegal case.  The “best practice” here should maybe be “one or more FFFDs, 
whatever makes your code faster”.

Best practices may not be requirements, but people will still take time to file 
bugs that something isn’t following a “best practice”.

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Markus Scherer 
via Unicode
Sent: Tuesday, May 16, 2017 11:37 AM
To: Alastair Houghton 
Cc: Philippe Verdy ; Henri Sivonen ; 
unicode Unicode Discussion ; Hans Åberg 

Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

Let me try to address some of the issues raised here.

The proposal changes a recommendation, not a requirement. Conformance applies 
to finding and interpreting valid sequences properly. This includes not 
consuming parts of valid sequences when dealing with illegal ones, as explained 
in the section "Constraints on Conversion Processes".

Otherwise, what you do with illegal sequences is a matter of what you think 
makes sense -- a matter of opinion and convenience. Nothing more.

I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU 
team. At the time, I believe the ISO UTF-8 definition was not yet limited to 
U+10, and decoding overlong sequences and those yielding surrogate code 
points was regarded as a misdemeanor. The spec has been tightened up, but I am 
pretty sure that most people familiar with how UTF-8 came about would recognize 
 and  as single sequences.

I believe that the discussion of how to handle illegal sequences came out of 
security issues a few years ago from some implementations including valid 
single and lead bytes with preceding illegal sequences. Beyond the "Constraints 
on Conversion Processes", there was evidently also a desire to recommend how to 
handle illegal sequences.

I think that the current recommendation was an extrapolation of common practice 
for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, 
but "it feels like" (yes, that's the level of argument for stuff that doesn't 
really matter) not treating  and  as single sequences is 
"weird".

Why do we care how we carve up an illegal sequence into subsequences? Only for 
debugging and visual inspection. Maybe some process is using illegal, overlong 
sequences to encode something special (à la Java string serialization, 
"modified UTF-8"), and for that it might be convenient too to treat overlong 
sequences as single errors.

If you don't like some recommendation, then do something else. It does not 
matter. If you don't reject the whole input but instead choose to replace 
illegal sequences with something, then make sure the something is not nothing 
-- replacing with an empty string can cause security issues. Otherwise, what 
the something is, or how many of them you put in, is not very relevant. One or 
more U+FFFDs is customary.

When the current recommendation came in, I thought it was reasonable but didn't 
like the edge cases. At the time, I didn't think it was important to twiddle 
with the text in the standard, and I didn't care that ICU didn't exactly 
implement that particular recommendation.

I have seen implementations that clobber every byte in an illegal sequence with 
a space, because it's easier than writing an U+FFFD for each byte or for some 
subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long 
illegal subsequence; that's fine, too.

Karl Williamson sent feedback to the UTC, "In short, I believe the best 
practices are wrong." I think "wrong" is far too strong, but I got an action 
item to propose a change in the text. I proposed a modified recommendation. 
Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" 
that was "right".

None of this is motivated by which UTF is used internally.

It is true that it takes a tiny bit more thought and work to recognize a wider 
set of sequences, but a capable implementer will optimize successfully for 
valid sequences, and maybe even for a subset of those for what might be 
expected high-frequency code point ranges. Error handling can go into a slow 
path. In a true state table implementation, it will require more states but 
should not affect the performance of valid sequences.

Many years ago, I decided for ICU to add a small amount of slow-path 
error-handling code for more

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 16 May 2017, at 19:36, Markus Scherer  wrote:
> 
> Let me try to address some of the issues raised here.

Thanks for jumping in.

The one thing I wanted to ask about was the “without ever restricting trail 
bytes to less than 80..BF”.  I think that could be misinterpreted; having 
thought about it some more, I think you mean “considering any trailing byte in 
the range 80..BF as valid”.  The “less than” threw me the first few times I 
read it and I started thinking you meant allowing any byte as a trailing byte, 
which is clearly not right.

Otherwise, I’m happy :-)

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Markus Scherer via Unicode

Let me try to address some of the issues raised here.

The proposal changes a recommendation, not a requirement. Conformance
applies to finding and interpreting valid sequences properly. This includes
not consuming parts of valid sequences when dealing with illegal ones, as
explained in the section "Constraints on Conversion Processes".

Otherwise, what you do with illegal sequences is a matter of what you think
makes sense -- a matter of opinion and convenience. Nothing more.

I wrote my first UTF-8 handling code some 18 years ago, before joining the
ICU team. At the time, I believe the ISO UTF-8 definition was not yet
limited to U+10, and decoding overlong sequences and those yielding
surrogate code points was regarded as a misdemeanor. The spec has been
tightened up, but I am pretty sure that most people familiar with how UTF-8
came about would recognize  and  as single sequences.

I believe that the discussion of how to handle illegal sequences came out
of security issues a few years ago from some implementations including
valid single and lead bytes with preceding illegal sequences. Beyond the
"Constraints on Conversion Processes", there was evidently also a desire to
recommend how to handle illegal sequences.

I think that the current recommendation was an extrapolation of common
practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for
UTF-8, too, but "it feels like" (yes, that's the level of argument for
stuff that doesn't really matter) not treating  and  as
single sequences is "weird".

Why do we care how we carve up an illegal sequence into subsequences? Only
for debugging and visual inspection. Maybe some process is using illegal,
overlong sequences to encode something special (à la Java string
serialization, "modified UTF-8"), and for that it might be convenient too
to treat overlong sequences as single errors.

If you don't like some recommendation, then do something else. It does not
matter. If you don't reject the whole input but instead choose to replace
illegal sequences with something, then make sure the something is not
nothing -- replacing with an empty string can cause security issues.
Otherwise, what the something is, or how many of them you put in, is not
very relevant. One or more U+FFFDs is customary.

When the current recommendation came in, I thought it was reasonable but
didn't like the edge cases. At the time, I didn't think it was important to
twiddle with the text in the standard, and I didn't care that ICU didn't
exactly implement that particular recommendation.

I have seen implementations that clobber every byte in an illegal sequence
with a space, because it's easier than writing an U+FFFD for each byte or
for some subsequences. Fine. Someone might write a single U+FFFD for an
arbitrarily long illegal subsequence; that's fine, too.

Karl Williamson sent feedback to the UTC, "In short, I believe the best
practices are wrong." I think "wrong" is far too strong, but I got an
action item to propose a change in the text. I proposed a modified
recommendation. Nothing gets elevated to "right" that wasn't, nothing gets
demoted to "wrong" that was "right".

None of this is motivated by which UTF is used internally.

It is true that it takes a tiny bit more thought and work to recognize a
wider set of sequences, but a capable implementer will optimize
successfully for valid sequences, and maybe even for a subset of those for
what might be expected high-frequency code point ranges. Error handling can
go into a slow path. In a true state table implementation, it will require
more states but should not affect the performance of valid sequences.

Many years ago, I decided for ICU to add a small amount of slow-path
error-handling code for more human-friendly illegal-sequence reporting. In
other words, this was not done out of convenience; it was an inconvenience
that seemed justified by nicer error reporting. If you don't like to do so,
then don't.

Which UTF is better? It depends. They all have advantages and problems.
It's all Unicode, so it's all good.

ICU largely uses UTF-16 but also UTF-8. It has data structures and code for
charset conversion, property lookup, sets of characters (UnicodeSet), and
collation that are co-optimized for both UTF-16 and UTF-8. It has a slowly
growing set of APIs working directly with UTF-8.

So, please take a deep breath. No conformance requirement is being touched,
no one is forced to do something they don't like, no special consideration
is given for one UTF over another.

Best regards,
markus

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> On 16 May 2017, at 20:01, Philippe Verdy  wrote:
> 
> On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random 
> sequences of 16-bit code units are not permitted. There's visibly a 
> validation step that returns an error if you attempt to create files with 
> invalid sequences (including other restrictions such as forbidding U+ and 
> some other problematic controls).

For it to work the way I suggested, there would be low level routines that 
handles the names raw, and then on top of that, interface routines doing what 
you describe. On the Austin Group List, they mentioned a filesystem doing it 
directly in UTF-16, and it could have been the one you describe.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 19:30 GMT+02:00 Shawn Steele via Unicode :

> C) The data was corrupted by some other means.  Perhaps bad
> concatenations, lost blocks during read/transmission, etc.  If we lost 2
> 512 byte blocks, then maybe we should have a thousand FFFDs (but how would
> we known?)
>

Thousands of U+FFFD's is not a problem (independantly of the internal UTF
encoding used): yes the 2512 byte block could then become 3 times larger
(if using UTF-8 internal encoding) or 2 times larger (if using UTF-16
internal encoding) but every application should be prepared to support the
size expansion with a completely know maximum factor, which could occur as
well with any valid CJK-only text.
So the size to allocate for the internal sorage is predictable from the
size of the input, this is an important feature of all standard UTF's.
Being able to handle the worst case of allowed expansion, militates largely
for the adoption of UTF-16 as the internal encoding, instead of UTF-8
(where you'll need to allocate more space before decoding the input, if you
want to avoid successive memory reallocations, which would impact the
performance of your decoder): it's simple to accept input from 512 bytes
(or 1KB) buffers, and allocate a 1KB (or 2KB) buffer for storing the
intermediate results in the generic decoder, and simpler on the outer level
to preallocate buffers with resonable sizes that will be reallocated once
if needed to the maximum size, and then reduced to the effective size (if
needed) at end of successful decoding (some implementations can use pools
of preallocated buffers with small static sizes, allocating new buffers out
side the pool only for rare cases where more space will be needed)
.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Asmus Freytag via Unicode

On 5/16/2017 10:30 AM, Shawn Steele via
Unicode wrote:

Would you advocate replacing

e0 80 80

with

U+FFFD U+FFFD U+FFFD (1)

rather than

U+FFFD (2)

It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t
want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t
see the logic in insisting that it must be decoded to *three* code points when it clearly only
represented one in the input.

It is not at all clear what the intent of the encoder was - or even if it's not just a problem with the data stream. E0 80 80 is not permitted, it's garbage. An encoder can't "intend" it.

Either
A) the "encoder" was attempting to be malicious, in which case the whole thing is suspect and garbage, and so the # of FFFD's doesn't matter, or

B) the "encoder" is completely broken, in which case all bets are off, again, specifying the # of FFFD's is irrelevant.

C) The data was corrupted by some other means. Perhaps bad concatenations, lost blocks during read/transmission, etc. If we lost 2 512 byte blocks, then maybe we should have a thousand FFFDs (but how would we known?)

-Shawn

Clearly, for the receiver, nothing reliable
can be deduced about the raw byte stream once an FFFD has been
inserted.
For the receiver, there's a fourth case that
might have been:

D) the raw UTF-8 stream contained a valid U+FFFD

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Regardless, it's not legal and hasn't been legal for quite some time.  
Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to 
anything depending on that fake-null, so one or three isn't really going to 
matter.

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham via Unicode
Sent: Tuesday, May 16, 2017 10:58 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On Tue, 16 May 2017 17:30:01 +
Shawn Steele via Unicode  wrote:

> > Would you advocate replacing
> 
> >   e0 80 80
> 
> > with
> 
> >   U+FFFD U+FFFD U+FFFD (1)  
> 
> > rather than
> 
> >   U+FFFD   (2)  
> 
> > It’s pretty clear what the intent of the encoder was there, I’d say, 
> > and while we certainly don’t want to decode it as a NUL (that was 
> > the source of previous security bugs, as I recall), I also don’t see 
> > the logic in insisting that it must be decoded to *three* code 
> > points when it clearly only represented one in the input.
> 
> It is not at all clear what the intent of the encoder was - or even if 
> it's not just a problem with the data stream.  E0 80 80 is not 
> permitted, it's garbage.  An encoder can't "intend" it.

It was once a legal way of encoding NUL, just like C0 E0, which is still in 
use, and seems to be the best way of storing NUL as character content in a *C 
string*.  (Strictly speaking, one can't do it.)  It could be lurking in old 
text or come from an old program that somehow doesn't get used for U+0080 to 
U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of 
converting UTF-16 to UTF-8.

Remember the conformance test for the Unicode Collation Algorithm has contained 
lone surrogates in the past, and the UAX on Unicode Regular Expressions used to 
require the ability to search for lone surrogates.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random
sequences of 16-bit code units are not permitted. There's visibly a
validation step that returns an error if you attempt to create files with
invalid sequences (including other restrictions such as forbidding U+
and some other problematic controls).

This occurs because the NTFS and FAT driver will also attempt to normalize
the string in order to create compatibility 8.3 filenames using the
system's native locale (not the current user locale which is used when
searching files/enumerating directories or opening files - this could
generate errors when the encodings for distinct locales do not match, but
should not cause errors when filenames are **first** searched in their
UTF-16 encoding specified in applications, but applications that still need
to access files using their short name are deprecated). The kind of
normalization taken for creating short 8.3 filenames uses OS-specific
specific conversion tables built in the filesystem drivers. This generation
however has a cost due to the uniqueness constraints (requiring to
abbreviate the first part of the 8.3 name to add "~numbered" suffixes
before the extension, whose value is unpredicatable if there are other
existing "*~1.*" files: it requires the driver to retry with another
number, looping if necessary). This also has a (very modest) storage cost
but it is less critical than the enumeration step and the fact that these
shortened name cannot be predicted by applications.

This canonicalization is also required also because the filesystem is
case-insensitive (and it's technically not possible to store all the
multiple case variants for filenames as assigned aliases/physical links).
In classic filesystems for Unix/Linux the only restrictions are on
forbidding null bytes, and assigning "/" a role for hierarchic filesystems
(unusable anywhere as directory entry name), plus the preservation of "."
and ".." entries in directories, meaning that only 8-bit encodings based on
7-bit ASCII are possible, so Linux/Unix are not completely treating thes
filenames as pure binary bags of bytes (however if this is not checked and
such random names may occur, which will be difficult to handle with classic
tools and shells). Some other filesystems for Linux/Unix are still
enforcing restrictions (and there exists even versions of them that are
supporting case insensitity, in addition to FAT12/FAT16/FAT32/exFAT/NTFS
emulated filesystems: this also exists in NFS driver as an option, or in
drivers for legacy filesystems initially coming from mainframes, or in
filesystem drivers based on FTP, and even in the filesystem driver allowing
to mount a Windows registry which is also case-insensitive).

Technically in the core kernel of Linux/Unix there's no restriction on the
effective encoding (except "/" and null), the actual restrictions are
implemented within filesystem drivers, configured only when volumes are
mounted: each mounted filesystem can then have its own internal encoding;
there will be different behaviors when using a driver for any MacOS
filesystem.

Linux can perfectly work with NTFS filesystems, except that most of the
time, short filenames will be completely ignored and not generated on the
fly.

This generation of short filenames in a legacy (unspecified) 8-bit codepage
is not a requirement of NTFS and it can be disabled also in Windows.

But FAT12/FAT16/FAT32 still require these legacy short names to be
generated when only the LFN could be used, and the short 8.3 name left
completely null in the main directory entry ; but legacy FAT drivers will
shoke on these null entries, if they are not tagged by a custom attribute
bit as "ignorable but not empty", or if the 8+3 characters do not use
specific unique parterns such as "\" followed by 7 pseudo-random characters
in the main part, plus 3 other pseudo-random characters in the extension
(these 10 characters may use any non null value: they provide nearly 80
bits or more exactly 250^10 identifiers if we exclude the 6 characters "/",
"\", ".", ":" NULL and SPACE that are reserved, which could be generated
almost predictably simply by hashing the original unabbreviated name with
79 bits from SHA-128, or faster with simple MD5 hahsing, and very rare
remaining collisions to handle).

Some FAT repait tools will attempt to repair the legacy short filenames
that are not unique or cannot be derived from the UTF-16 encoded LFN (this
happens when "repairing" a FAT volume initially created on another system
that used a different 8-bit OEM codepage, but this "CheckDisk" tools should
have an option to not "repair" them, given that modern applications
normally do not need these filenames if a LFN is present (even the Windows
Explorer will not display these short names because trhey are hidden by
default each time there's a LFN which overrides them).

We must add however that on FAT filesystems, a LFN will not always be
stored if the Unicode name already has

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, 16 May 2017 17:30:01 +
Shawn Steele via Unicode  wrote:

> > Would you advocate replacing  
> 
> >   e0 80 80  
> 
> > with  
> 
> >   U+FFFD U+FFFD U+FFFD (1)  
> 
> > rather than  
> 
> >   U+FFFD   (2)  
> 
> > It’s pretty clear what the intent of the encoder was there, I’d
> > say, and while we certainly don’t want to decode it as a NUL (that
> > was the source of previous security bugs, as I recall), I also
> > don’t see the logic in insisting that it must be decoded to *three*
> > code points when it clearly only represented one in the input.  
> 
> It is not at all clear what the intent of the encoder was - or even
> if it's not just a problem with the data stream.  E0 80 80 is not
> permitted, it's garbage.  An encoder can't "intend" it.

It was once a legal way of encoding NUL, just like C0 E0, which is
still in use, and seems to be the best way of storing NUL as character
content in a *C string*.  (Strictly speaking, one can't do it.)  It
could be lurking in old text or come from an old program that somehow
doesn't get used for U+0080 to U+07FF. Converting everything in UCS-2
to 3 bytes was an easily encoded way of converting UTF-16 to UTF-8.

Remember the conformance test for the Unicode Collation Algorithm has
contained lone surrogates in the past, and the UAX on Unicode Regular
Expressions used to require the ability to search for lone surrogates.

Richard.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> Would you advocate replacing

>   e0 80 80

> with

>   U+FFFD U+FFFD U+FFFD (1)

> rather than

>   U+FFFD   (2)

> It’s pretty clear what the intent of the encoder was there, I’d say, and 
> while we certainly don’t 
> want to decode it as a NUL (that was the source of previous security bugs, as 
> I recall), I also don’t
> see the logic in insisting that it must be decoded to *three* code points 
> when it clearly only 
> represented one in the input.

It is not at all clear what the intent of the encoder was - or even if it's not 
just a problem with the data stream.  E0 80 80 is not permitted, it's garbage.  
An encoder can't "intend" it.

Either
A) the "encoder" was attempting to be malicious, in which case the whole thing 
is suspect and garbage, and so the # of FFFD's doesn't matter, or

B) the "encoder" is completely broken, in which case all bets are off, again, 
specifying the # of FFFD's is irrelevant.

C) The data was corrupted by some other means.  Perhaps bad concatenations, 
lost blocks during read/transmission, etc.  If we lost 2 512 byte blocks, then 
maybe we should have a thousand FFFDs (but how would we known?)

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8


> On 16 May 2017, at 18:38, Alastair Houghton  
> wrote:
> 
> On 16 May 2017, at 17:23, Hans Åberg  wrote:
>> 
>> HFS implements case insensitivity in a layer above the filesystem raw 
>> functions. So it is perfectly possible to have files that differ by case 
>> only in the same directory by using low level function calls. The Tenon 
>> MachTen did that on Mac OS 9 already.
> 
> You keep insisting on this, but it’s not true; I’m a disk utility developer, 
> and I can tell you for a fact that HFS+ uses a B+-Tree to hold its directory 
> data (a single one for the entire disk, not one per directory either), and 
> that that tree is sorted by (CNID, filename) pairs.  And since it’s 
> case-preserving *and* case-insensitive, the comparisons it does to order its 
> B+-Tree nodes *cannot* be raw.  I should know - I’ve actually written the 
> code for it!
> 
> Even for legacy HFS, which didn’t store UTF-16, but stored a specified Mac 
> legacy encoding (the encoding used is in the volume header), it’s case 
> sensitive, so the encoding matters.
> 
> I don’t know what tricks Tenon MachTen pulled on Mac OS 9, but I *do* know 
> how the filesystem works.

One could make files that differed by case in the same directory, and Mac OS 9 
did not bother. Legacy HFS tended to slow down with many files in the same 
directory, so that gave an impression of a tree structure. The BSD filesystem 
at the time, perhaps the one that Mac OS X once supported, did not store files 
in a tree, but flat with redundancy.  The other info I got on the Austin Group 
List a decade ago.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 16 May 2017, at 17:23, Hans Åberg  wrote:
> 
> HFS implements case insensitivity in a layer above the filesystem raw 
> functions. So it is perfectly possible to have files that differ by case only 
> in the same directory by using low level function calls. The Tenon MachTen 
> did that on Mac OS 9 already.

You keep insisting on this, but it’s not true; I’m a disk utility developer, 
and I can tell you for a fact that HFS+ uses a B+-Tree to hold its directory 
data (a single one for the entire disk, not one per directory either), and that 
that tree is sorted by (CNID, filename) pairs.  And since it’s case-preserving 
*and* case-insensitive, the comparisons it does to order its B+-Tree nodes 
*cannot* be raw.  I should know - I’ve actually written the code for it!

Even for legacy HFS, which didn’t store UTF-16, but stored a specified Mac 
legacy encoding (the encoding used is in the volume header), it’s case 
sensitive, so the encoding matters.

I don’t know what tricks Tenon MachTen pulled on Mac OS 9, but I *do* know how 
the filesystem works.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8


> On 16 May 2017, at 18:13, Alastair Houghton  
> wrote:
> 
> On 16 May 2017, at 17:07, Hans Åberg  wrote:
>> 
> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
> UCS-2/UTF-16. ...
 
 The filesystem directory is using octet sequences and does not bother 
 passing over an encoding, I am told. Someone could remember one that to 
 used UTF-16 directly, but I think it may not be current.
>>> 
>>> No, that’s not true.  All three of those systems store UTF-16 on the disk 
>>> (give or take).
>> 
>> I am not speaking about what they store, but how the filesystem identifies 
>> files.
> 
> Well, quite clearly none of those systems treat the UTF-16 strings as binary 
> either - they’re case insensitive, so how could they?  HFS+ even normalises 
> strings using a variant of a frozen version of the normalisation spec.

HFS implements case insensitivity in a layer above the filesystem raw 
functions. So it is perfectly possible to have files that differ by case only 
in the same directory by using low level function calls. The Tenon MachTen did 
that on Mac OS 9 already.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 16 May 2017, at 17:07, Hans Åberg  wrote:
> 
 HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
 UCS-2/UTF-16. ...
>>> 
>>> The filesystem directory is using octet sequences and does not bother 
>>> passing over an encoding, I am told. Someone could remember one that to 
>>> used UTF-16 directly, but I think it may not be current.
>> 
>> No, that’s not true.  All three of those systems store UTF-16 on the disk 
>> (give or take).
> 
> I am not speaking about what they store, but how the filesystem identifies 
> files.

Well, quite clearly none of those systems treat the UTF-16 strings as binary 
either - they’re case insensitive, so how could they?  HFS+ even normalises 
strings using a variant of a frozen version of the normalisation spec.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8


> On 16 May 2017, at 17:52, Alastair Houghton  
> wrote:
> 
> On 16 May 2017, at 16:44, Hans Åberg  wrote:
>> 
>> On 16 May 2017, at 17:30, Alastair Houghton via Unicode 
>>  wrote:
>>> 
>>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
>>> UCS-2/UTF-16. ...
>> 
>> The filesystem directory is using octet sequences and does not bother 
>> passing over an encoding, I am told. Someone could remember one that to used 
>> UTF-16 directly, but I think it may not be current.
> 
> No, that’s not true.  All three of those systems store UTF-16 on the disk 
> (give or take).

I am not speaking about what they store, but how the filesystem identifies 
files.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 16 May 2017, at 16:44, Hans Åberg  wrote:
> 
> On 16 May 2017, at 17:30, Alastair Houghton via Unicode  
> wrote:
>> 
>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
>> UCS-2/UTF-16. ...
> 
> The filesystem directory is using octet sequences and does not bother passing 
> over an encoding, I am told. Someone could remember one that to used UTF-16 
> directly, but I think it may not be current.

No, that’s not true.  All three of those systems store UTF-16 on the disk (give 
or take).  On Windows, the “ANSI” APIs convert the filenames to or from the 
appropriate Windows code page, while the “Wide” API works in UTF-16, which is 
the native encoding for VFAT long filenames and NTFS filenames.  And, as I 
said, on Mac OS X and iOS, the kernel expects filenames to be encoded as UTF-8 
at the BSD API, regardless of what encoding you might be using in your Terminal 
(this is different to traditional UNIX behaviour, where how you interpret your 
filenames is entirely up to you - usually you’d use the same encoding you were 
using on your tty).

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8


> On 16 May 2017, at 17:30, Alastair Houghton via Unicode  
> wrote:
> 
> On 16 May 2017, at 14:23, Hans Åberg via Unicode  wrote:
>> 
>> You don't. You have a filename, which is a octet sequence of unknown 
>> encoding, and want to deal with it. Therefore, valid Unicode transformations 
>> of the filename may result in that is is not being reachable.
>> 
>> It only matters that the correct octet sequence is handed back to the 
>> filesystem. All current filsystems, as far as experts could recall, use 
>> octet sequences at the lowest level; whatever encoding is used is built in a 
>> layer above. 
> 
> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
> UCS-2/UTF-16. ...

The filesystem directory is using octet sequences and does not bother passing 
over an encoding, I am told. Someone could remember one that to used UTF-16 
directly, but I think it may not be current.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 16 May 2017, at 14:23, Hans Åberg via Unicode  wrote:
> 
> You don't. You have a filename, which is a octet sequence of unknown 
> encoding, and want to deal with it. Therefore, valid Unicode transformations 
> of the filename may result in that is is not being reachable.
> 
> It only matters that the correct octet sequence is handed back to the 
> filesystem. All current filsystems, as far as experts could recall, use octet 
> sequences at the lowest level; whatever encoding is used is built in a layer 
> above. 

HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
UCS-2/UTF-16.  FAT 8.3 names are also encoded, but the encoding isn’t specified 
(more specifically, MS-DOS and Windows assume an encoding based on your locale, 
which could cause all kinds of fun if you swapped disks with someone from a 
different country, and IIRC there are some shenanigans for Japan because of the 
use of 0xe5 as a deleted file marker).  There are some less widely used 
filesystems that require a particular encoding also (BeOS’ BFS used UTF-8, for 
instance).

Also, Mac OS X and iOS use UTF-8 at the BSD layer; if a filesystem is in use 
whose names can’t be converted to UTF-8, the Darwin kernel uses a percent 
encoding scheme(!)

It looks like Apple has changed its mind for APFS and is going with the “bag of 
bytes” approach that’s typical of other systems; at least, that’s what it 
appears to have done on iOS.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 15:23 GMT+02:00 Hans Åberg :

> All current filsystems, as far as experts could recall, use octet
> sequences at the lowest level; whatever encoding is used is built in a
> layer above
>

Not NTFS (on Windows) which uses sequences of 16bit units. Same about
FAT32/exFAT within "Long File Names" (the legacy 8.3 short filenames are
using legacy 8-bit codepages, but these are alternate filenames used when
long filenames are not found, and working mostly like aliasing physical
links on Unix filesystems, as if they were separate directory entries,
except that they are hidden by default when their matching LFN are already
shown)

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> On 16 May 2017, at 15:00, Philippe Verdy  wrote:
> 
> 2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode :
> 
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode  
> > wrote:
> ...
> > I think Unicode should not adopt the proposed change.
> 
> It would be useful, for use with filesystems, to have Unicode codepoint 
> markers that indicate how UTF-8, including non-valid sequences, is translated 
> into UTF-32 in a way that the original octet sequence can be restored.
> 
> Why just UTF-32 ?

Synonym for codepoint numbers. It would suffice to add markers how it is 
translated. For example, codepoints meaning "overlong long length ", 
"byte", or whatever is useful.

> How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid 
> UTF-8/UTF-16/UTF-32 ?

You don't. You have a filename, which is a octet sequence of unknown encoding, 
and want to deal with it. Therefore, valid Unicode transformations of the 
filename may result in that is is not being reachable.

It only matters that the correct octet sequence is handed back to the 
filesystem. All current filsystems, as far as experts could recall, use octet 
sequences at the lowest level; whatever encoding is used is built in a layer 
above.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, 16 May 2017 14:44:44 +0200
Hans Åberg via Unicode  wrote:

> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode
> >  wrote:  
> ...
> > I think Unicode should not adopt the proposed change.  
> 
> It would be useful, for use with filesystems, to have Unicode
> codepoint markers that indicate how UTF-8, including non-valid
> sequences, is translated into UTF-32 in a way that the original octet
> sequence can be restored.

Escape sequences for the inappropriate bytes is the natural technique.
Your problem is smoothly transitioning so that the escape character is
always escaped when it means itself. Strictly, it can't be done.

Of course, some sequences of escaped characters should be prohibited.
Checking could be fiddly.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, 16 May 2017 20:08:52 +0900
"Martin J. Dürst via Unicode"  wrote:

> I agree with others that ICU should not be considered to have a
> special status, it should be just one implementation among others.

> [The next point is a side issue, please don't spend too much time on 
> it.] I find it particularly strange that at a time when UTF-8 is
> firmly defined as up to 4 bytes, never including any bytes above
> 0xF4, the Unicode consortium would want to consider recommending that
>  be converted to a single U+FFFD. I note with
> agreement that Markus seems to have thoughts in the same direction,
> because the proposal (17168-utf-8-recommend.pdf) says "(I suppose
> that lead bytes above F4 could be somewhat debatable.)".

The undesirable sidetrack, I suppose, is worrying about how many planes
will be required for emoji.

However, it does make for the point that, while some practices may be
better than other, there isn't necessarily a best practice.

The English of the proposal is unclear - the text would benefit from
showing some maximal subsequences (poor terminology - some of us are
used to non-contiguous subsequences).  When he writes, "For UTF-8,
recommend evaluating maximal subsequences based on the original
structural definition of UTF-8, without ever restricting trail bytes to
less than 80..BF", I am pretty sure he means "For UTF-8,
recommend evaluating maximal subsequences based on the original
structural definition of UTF-8, with the only restriction on trailing
bytes beyond the number of them being that they must be in the range
80..BF".

Thus Philippe's example of "E0 E0 C3 89" would be converted with an
error flagged to a sequence of scalar values FFFD FFFD C9.

This may make a UTF-8 system usable if it tries to use something like
non-characters as understood before CLDR was caught publishing them
as an essential part of text files.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode :

>
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode 
> wrote:
> ...
> > I think Unicode should not adopt the proposed change.
>
> It would be useful, for use with filesystems, to have Unicode codepoint
> markers that indicate how UTF-8, including non-valid sequences, is
> translated into UTF-32 in a way that the original octet sequence can be
> restored.

Why just UTF-32 ? How would you convert ill-formed UTF-8/UTF-16/UTF-32 to
valid UTF-8/UTF-16/UTF-32 ?

In all cases this would require extensions on the 3 standards (which MUST
be interoperable), then you'll shoke on new validation rules for these 3
standards for these extensions, and new ill-formed sequences that you won't
be able to convert interoperably. Given the most restrictive condition in
UTF-16 (which is still the most widely used internal representation), such
extensions would be very complex too manage.

There's no solution, such extensions in any one of them are then
undesirable and can only be used privately (but without interoperating with
the other 2 representations), so it's impossible to make sure the original
octet sequences can be restored.

Any deviation of the UTF-8/16/32 will be bounded in the same UTF. It cannot
be part of the 3 standard UTF, but may be part of a distinct encoding, not
fully compatible with the 3 standards.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> On 15 May 2017, at 12:21, Henri Sivonen via Unicode  
> wrote:
...
> I think Unicode should not adopt the proposed change.

It would be useful, for use with filesystems, to have Unicode codepoint markers 
that indicate how UTF-8, including non-valid sequences, is translated into 
UTF-32 in a way that the original octet sequence can be restored.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode :

> > One additional note: the standard codifies this behaviour as a
> *recommendation*, not a requirement.
>
> This is an odd argument in favor of changing it. If the argument is
> that it's just a recommendation that you don't need to adhere to,
> surely then the people who don't like the current recommendation
> should choose not to adhere to it instead of advocating changing it.


I also agree. The internet is full of RFC specifications that are also
"best practices" and even in this case, changing them must be extensively
documented, including discussing new compatibility/interoperability
problems and new security risks.

The case of random access in substrings is significant because what was
once valid UTF-8 could become invalid if the best recommandation is not
followed, and then could cause unexpected failures, uncaught exceptions
causing software to suddenly fail and become subject to possible attacks
due to this new failure (this is mostly a problem for implementations that
do not use "safe" U+FFFD replacements but throw exceptions on ill-formed
input: we should not change the cases where these exceptions may occur by
adding new cases caused by a change of implementation based on a change of
best practice).

The considerations about trying to reduce the nnumber of U+FFFD is not
relevant, purely esthetic because some people would like to compact the
decoded result in memory. What is really import is to not ignore silently
these ill-formed sequences, and properly track that there was some data
loss. The number of U+FFFD (only one or as many as there are invalid code
units in the input before the first resynchronization point) inserted is
not so important.

As well, wether implementations will use an accumulator or just a single
state (where each state knows how many code units have been parsed without
emitting an output code point, so that these code points can be decoded by
relative indexed accesses) is not relevant, it is just a very minor
optimization case (in my opinion, using an accumulator that can live in a
CPU register is faster than using relative indexed accesses

All modern CPUs have enough registers to store that accumulator, and the
input and output pointers, and a finite state number is not needed when the
state can be tracked by the executable instruction position where you don't
necessarily need to loop for each code unit but can easily write your
decoder so that each loop will process a full codepoint or will emit a
single U+FFFD before adjusting the input pointer : UTF-8 and UTF-16
complexity is small enough that unwinding such loops will be easy to
implement for processing full code points instead of single code units:

That code will still remain very small (fitting fully in instruction
cache), and it will be faster because it will avoid several conditional
branches and because it will save one register (for the finite state
number) that will not ned to be slowly saved on a stack: 2 pointer
registers (or 2 access function/method addresses) + 2 data registers + the
PC instruction counter is enough.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Martin J. Dürst via Unicode


Hello everybody,

[using this mail to in effect reply to different mails in the thread]

On 2017/05/16 17:31, Henri Sivonen via Unicode wrote:

On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag  wrote:



Under what circumstance would it matter how many U+FFFDs you see?


Maybe it doesn't, but I don't think the burden of proof should be on
the person advocating keeping the spec and major implementations as
they are. If anything, I think those arguing for a change of the spec
in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
with the current spec should show why it's important to have a
different number of U+FFFDs than the spec's "best practice" calls for
now.


I have just checked (the programming language) Ruby. Some background:

As you might know, Ruby is (at least in theory) pretty 
encoding-independent, meaning you can run scripts in iso-8859-1, in 
Shift_JIS, in UTF-8, or in any of quite a few other encodings directly, 
without any conversion.


However, in practice, incl. Ruby on Rails, Ruby is very much using UTF-8 
internally, and is optimized to work well that way. Character encoding 
conversion also works with UTF-8 as the pivot encoding.


As far as I understand, Ruby does the same as all of the above software, 
based (among else) on the fact that we followed the recommendation in 
the standard. Here are a few examples (sorry for the linebreaks 
introduced by mail software):


$ ruby -e 'puts "\xF0\xaf".encode("UTF-16BE", invalid: :replace).inspect'
#=>"\uFFFD"

$ ruby -e 'puts "\xe0\x80\x80".encode("UTF-16BE", invalid: 
:replace).inspect'

#=>"\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xF4\x90\x80\x80".encode("UTF-16BE", invalid: 
:replace).inspect'

#=>"\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xfd\x81\x82\x83\x84\x85".encode("UTF-16BE", invalid: 
:replace).inspect

#=>"\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\x41\xc0\xaf\x41\xf4\x80\x80\x41".encode("UTF-16BE", 
invalid: :replace).inspect'

#=>"A\uFFFD\uFFFDA\uFFFDA"

This is based on http://www.unicode.org/review/pr-121.html as noted at
https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?revision=56516=markup#l1507
(for those having a look at these tests, in Ruby's version of 
assert_equal, the expected value comes first (not sure whether this is 
called little-endian or big-endian :-), but this is a decision where the 
various test frameworks are virtually split 50/50 :-(. ))


Even if the above examples and the tests use conversion to UTF-16 (in 
particular the BE variant for better readability), what happens 
internally is that the input is analyzed byte-by-byte. In this case, it 
is easiest to just stop as soon as something is found that is clearly 
invalid (be this a single byte or something longer). This makes a 
data-driven implementation (such as the Ruby transcoder) or one based on 
a state machine (such as http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) 
more compact.


In other words, because we never know whether the next byte is a valid 
one such as 0x41, it's easier to just handle one byte at a time if this 
way we can avoid lookahead (which is always a good idea when parsing).


I agree with Henri and others that there is no need at all to change the 
recommendation in the standard that has been stable for so long (close 
to 9 years).


Because the original was done on a PR 
(http://www.unicode.org/review/pr-121.html), I think this should at 
least also be handled as PR (if it's not dropped based on the discussion 
here).


I think changing the current definition of "maximal subsequence" is a 
bad idea, because it would mean that one wouldn't know what one was 
speaking about over the years. If necessary, new definitions should be 
introduced for other variants.


I agree with others that ICU should not be considered to have a special 
status, it should be just one implementation among others.


[The next point is a side issue, please don't spend too much time on 
it.] I find it particularly strange that at a time when UTF-8 is firmly 
defined as up to 4 bytes, never including any bytes above 0xF4, the 
Unicode consortium would want to consider recommending that 84 85> be converted to a single U+FFFD. I note with agreement that 
Markus seems to have thoughts in the same direction, because the 
proposal (17168-utf-8-recommend.pdf) says "(I suppose that lead bytes 
above F4 could be somewhat debatable.)".



Regards,Martin.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

>
> The proposal actually does cover things that aren’t structurally valid,
> like your e0 e0 e0 example, which it suggests should be a single U+FFFD
> because the initial e0 denotes a three byte sequence, and your 80 80 80
> example, which it proposes should constitute three illegal subsequences
> (again, both reasonable).  However, I’m not entirely certain about things
> like
>
>   e0 e0 c3 89
>
> which the proposal would appear to decode as
>
>   U+FFFD U+FFFD U+FFFD U+FFFD  (3)
>
> instead of a perhaps more reasonable
>
>   U+FFFD U+FFFD U+00C9 (4)
>
> (the key part is the “without ever restricting trail bytes to less than
> 80..BF”)
>

I also agree with that, due to access in strings from random position: if
you access it from byte 0x89, you can assume it's a trialing byte and
you'll want to look backward, and will see 0xc3,0x89 which will decode
correctly as U+00C9 without any error detected.

So the wrong bytes are only the initial two occurences of 0x80 which are
individually converted to U+FFFD.

In summary: when you detect any ill-formed sequence, only replace the first
code unit by U+FFFD and restart scanning from the next code unit, without
skeeping over multiple bytes.

This means that multiple occurences of U+FFFD is not only the best
practice, it also matches the intended design of UTF-8 to allow access from
random positions.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton
 wrote:
> On 16 May 2017, at 09:31, Henri Sivonen via Unicode  
> wrote:
>>
>> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
>>  wrote:
>>> That would be true if the in-memory representation had any effect on what 
>>> we’re talking about, but it really doesn’t.
>>
>> If the internal representation is UTF-16 (or UTF-32), it is a likely
>> design that there is a variable into which the scalar value of the
>> current code point is accumulated during UTF-8 decoding.
>
> That’s quite a likely design with a UTF-8 internal representation too; it’s 
> just that you’d only decode during processing, as opposed to immediately at 
> input.

The time to generate the U+FFFDs is at the input time which is what's
at issue here. The later processing, which may then involve iterating
by code point and involving computing the scalar values is a different
step that should be able to assume valid UTF-8 and not be concerned
with invalid UTF-8. (To what extent different programming languages
and frameworks allow confident maintenance of the invariant that after
input all in-RAM UTF-8 can be treated as valid varies.)

>> When the internal representation is UTF-8, only UTF-8 validation is
>> needed, and it's natural to have a fail-fast validator, which *doesn't
>> necessarily need such a scalar value accumulator at all*.
>
> Sure.  But a state machine can still contain appropriate error states without 
> needing an accumulator.

As I said upthread, it could, but it seems inappropriate to ask
implementations to take on that extra complexity on as weak grounds as
"ICU does it" or "feels right" when the current recommendation doesn't
call for those extra states and the current spec is consistent with a
number of prominent non-ICU implementations, including Web browsers.

>>> In what sense is this “interop”?
>>
>> In the sense that prominent independent implementations do the same
>> externally observable thing.
>
> The argument is, I think, that in this case the thing they are doing is the 
> *wrong* thing.

It's seems weird to characterize following the currently-specced "best
practice" as "wrong" without showing a compelling fundamental flaw
(such as a genuine security problem) in the currently-specced "best
practice". With implementations of the currently-specced "best
practice" already shipped, I don't think aesthetic preferences should
be considered enough of a reason to proclaim behavior adhering to the
currently-specced "best practice" as "wrong".

>  That many of them do it would only be an argument if there was some reason 
> that it was desirable that they did it.  There doesn’t appear to be such a 
> reason, unless you can think of something that hasn’t been mentioned thus far?

I've already given a reason: UTF-8 validation code not needing to have
extra states catering to aesthetic considerations of U+FFFD
consolidation.

>  The only reason you’ve given, to date, is that they currently do that, so 
> that should be the recommended behaviour (which is little different from the 
> argument - which nobody deployed - that ICU currently does the other thing, 
> so *that* should be the recommended behaviour; the only difference is that 
> *you* care about browsers and don’t care about ICU, whereas you yourself 
> suggested that some of us might be advocating this decision because we care 
> about ICU and not about e.g. browsers).

Not just browsers. Also OpenJDK and Python 3. Do I really need to test
the standard libraries of more languages/systems to more strongly make
the case that the ICU behavior (according to the proposal PDF) is not
the norm and what the spec currently says is?

> I’ll add also that even among the implementations you cite, some of them 
> permit surrogates in their UTF-8 input (i.e. they’re actually processing 
> CESU-8, not UTF-8 anyway).  Python, for example, certainly accepts the 
> sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” 
> implementation that conformed literally to the recommendation, as you seem to 
> want, should instead replace it with *four* U+FFFDs (I think), no?

I see that behavior in Python 2. Earlier, I said that Python 3 agrees
with the current spec for my test case. The Python 2 behavior I see is
not just against "best practice" but obviously incompliant.

(For details: I tested Python 2.7.12 and 3.5.2 as shipped on Ubuntu 16.04.)

> One additional note: the standard codifies this behaviour as a 
> *recommendation*, not a requirement.

This is an odd argument in favor of changing it. If the argument is
that it's just a recommendation that you don't need to adhere to,
surely then the people who don't like the current recommendation
should choose not to adhere to it instead of advocating changing it.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 16 May 2017, at 09:31, Henri Sivonen via Unicode  wrote:
> 
> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
>  wrote:
>> That would be true if the in-memory representation had any effect on what 
>> we’re talking about, but it really doesn’t.
> 
> If the internal representation is UTF-16 (or UTF-32), it is a likely
> design that there is a variable into which the scalar value of the
> current code point is accumulated during UTF-8 decoding.

That’s quite a likely design with a UTF-8 internal representation too; it’s 
just that you’d only decode during processing, as opposed to immediately at 
input.

> When the internal representation is UTF-8, only UTF-8 validation is
> needed, and it's natural to have a fail-fast validator, which *doesn't
> necessarily need such a scalar value accumulator at all*.

Sure.  But a state machine can still contain appropriate error states without 
needing an accumulator.  That the ones you care about currently don’t is 
readily apparent, but there’s nothing stopping them from doing so.

I don’t see this as an argument about implementations, since it really makes 
very little difference to the implementation which approach is taken; in both 
internal representations, the question is whether you generate U+FFFD 
immediately on detection of the first incorrect *byte*, or whether you do so 
after reading a complete sequence.  UTF-8 sequences are bounded anyway, so it 
isn’t as if failing early gives you any significant performance benefit.

>> In what sense is this “interop”?
> 
> In the sense that prominent independent implementations do the same
> externally observable thing.

The argument is, I think, that in this case the thing they are doing is the 
*wrong* thing.  That many of them do it would only be an argument if there was 
some reason that it was desirable that they did it.  There doesn’t appear to be 
such a reason, unless you can think of something that hasn’t been mentioned 
thus far?  The only reason you’ve given, to date, is that they currently do 
that, so that should be the recommended behaviour (which is little different 
from the argument - which nobody deployed - that ICU currently does the other 
thing, so *that* should be the recommended behaviour; the only difference is 
that *you* care about browsers and don’t care about ICU, whereas you yourself 
suggested that some of us might be advocating this decision because we care 
about ICU and not about e.g. browsers).

I’ll add also that even among the implementations you cite, some of them permit 
surrogates in their UTF-8 input (i.e. they’re actually processing CESU-8, not 
UTF-8 anyway).  Python, for example, certainly accepts the sequence [ed a0 bd 
ed b8 80] and decodes it as U+1F600; a true “fast fail” implementation that 
conformed literally to the recommendation, as you seem to want, should instead 
replace it with *four* U+FFFDs (I think), no?

One additional note: the standard codifies this behaviour as a 
*recommendation*, not a requirement.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8


> On 16 May 2017, at 10:29, David Starner  wrote:
> 
> On Tue, May 16, 2017 at 1:45 AM Alastair Houghton 
>  wrote:
> That’s true anyway; imagine the database holds raw bytes, that just happen to 
> decode to U+FFFD.  There might seem to be *two* names that both contain 
> U+FFFD in the same place.  How do you distinguish between them?
> 
>> If the database holds raw bytes, then the name is a byte string, not a 
>> Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule 
>> to make and enforce that a string in a database is a validly formatted 
>> string; I would hope that most SQL servers do in fact reject malformed UTF-8 
>> strings. On the other hand, I'd expect that an SQL server would accept 
>> U+FFFD in a Unicode string.

Databases typically separate the encoding in which strings are stored from the 
encoding in which an application connected to the database is operating.  A 
database might well hold data in (say) ISO Latin 1, EUC-JP, or indeed any other 
character set, while presenting it to a client application as UTF-8 or UTF-16.  
Hence my comment - application software could very well see two names that are 
apparently identical and that include U+FFFDs in the same places, even though 
the database back-end actually has different strings.  As I said, this is a 
problem we already have.

> I don’t see a problem; the point is that where a structurally valid UTF-8 
> encoding has been used, albeit in an invalid manner (e.g. encoding a number 
> that is not a valid code point, or encoding a valid code point as an 
> over-long sequence), a single U+FFFD is appropriate.  That seems a perfectly 
> sensible rule to adopt.
>  
>> It seems like a perfectly arbitrary rule to adopt; I'd like to assume that 
>> the only source of such UTF-8 data is willful attempts to break security, 
>> and in that case, how is this a win? Nonattack sources of broken data are 
>> much more likely to be the result of mixing UTF-8 with other character 
>> encodings or raw binary data.

I’d say there are three sources of UTF-8 data of that ilk:

(a) bugs,
(b) “Modified UTF-8” and “CESU-8” implementations,
(c) wilful attacks

(b) in particular is quite common, and the result of the presently recommended 
approach doesn’t make much sense there ([c0 80] will get replaced with *two* 
U+FFFDs, while [ed a0 bd ed b8 80] will be replaced by *four* U+FFFDs - 
surrogates aren’t supposed to be valid in UTF-8, right?)

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread David Starner via Unicode

On Tue, May 16, 2017 at 1:45 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:

> That’s true anyway; imagine the database holds raw bytes, that just happen
> to decode to U+FFFD.  There might seem to be *two* names that both contain
> U+FFFD in the same place.  How do you distinguish between them?
>

If the database holds raw bytes, then the name is a byte string, not a
Unicode string, and can't contain U+FFFD at all. It's a relatively easy
rule to make and enforce that a string in a database is a validly formatted
string; I would hope that most SQL servers do in fact reject malformed
UTF-8 strings. On the other hand, I'd expect that an SQL server would
accept U+FFFD in a Unicode string.

> I don’t see a problem; the point is that where a structurally valid UTF-8
> encoding has been used, albeit in an invalid manner (e.g. encoding a number
> that is not a valid code point, or encoding a valid code point as an
> over-long sequence), a single U+FFFD is appropriate.  That seems a
> perfectly sensible rule to adopt.
>

It seems like a perfectly arbitrary rule to adopt; I'd like to assume that
the only source of such UTF-8 data is willful attempts to break security,
and in that case, how is this a win? Nonattack sources of broken data are
much more likely to be the result of mixing UTF-8 with other character
encodings or raw binary data.

>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> On 16 May 2017, at 09:18, David Starner  wrote:
> 
> On Tue, May 16, 2017 at 12:42 AM Alastair Houghton 
>  wrote:
>> If you’re about to mutter something about security, consider this: security 
>> code *should* refuse to compare strings that contain U+FFFD (or at least 
>> should never treat them as equal, even to themselves), because it has no way 
>> to know what that code point represents.
>> 
> Which causes various other security problems; if an object (file, database 
> element, etc.) gets a name with a FFFD in it, it becomes impossible to 
> reference. That an IEEE 754 float may not equal itself is a perpetual source 
> of confusion for programmers.

That’s true anyway; imagine the database holds raw bytes, that just happen to 
decode to U+FFFD.  There might seem to be *two* names that both contain U+FFFD 
in the same place.  How do you distinguish between them?

Clearly if you are holding Unicode code points that you know are validly 
encoded somehow, you may want to be able to match U+FFFDs, but that’s a special 
case where you have extra knowledge.

> In this case, It's pretty clear, but I don't see it as a general rule.  Any 
> rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake 
> or random binary data.

I don’t see a problem; the point is that where a structurally valid UTF-8 
encoding has been used, albeit in an invalid manner (e.g. encoding a number 
that is not a valid code point, or encoding a valid code point as an over-long 
sequence), a single U+FFFD is appropriate.  That seems a perfectly sensible 
rule to adopt.

The proposal actually does cover things that aren’t structurally valid, like 
your e0 e0 e0 example, which it suggests should be a single U+FFFD because the 
initial e0 denotes a three byte sequence, and your 80 80 80 example, which it 
proposes should constitute three illegal subsequences (again, both reasonable). 
 However, I’m not entirely certain about things like

  e0 e0 c3 89

which the proposal would appear to decode as

  U+FFFD U+FFFD U+FFFD U+FFFD  (3)

instead of a perhaps more reasonable

  U+FFFD U+FFFD U+00C9 (4)

(the key part is the “without ever restricting trail bytes to less than 80..BF”)

and if Markus or others could explain why they chose (3) over (4) I’d be quite 
interested to hear the explanation.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag  wrote:
> but I think the way he raises this point is needlessly antagonistic.

I apologize. My level of dismay at the proposal's ICU-centricity overcame me.

On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
 wrote:
> That would be true if the in-memory representation had any effect on what 
> we’re talking about, but it really doesn’t.

If the internal representation is UTF-16 (or UTF-32), it is a likely
design that there is a variable into which the scalar value of the
current code point is accumulated during UTF-8 decoding. In such a
scenario, it can be argued as "natural" to first operate according to
the general structure of UTF-8 and then inspect what you got in the
accumulation variable (ruling out non-shortest forms, values above the
Unicode range and surrogate values after the fact).

When the internal representation is UTF-8, only UTF-8 validation is
needed, and it's natural to have a fail-fast validator, which *doesn't
necessarily need such a scalar value accumulator at all*. The
construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when
used as a UTF-8 validator is the best illustration of a UTF-8
validator not necessarily looking like a "natural" UTF-8 to UTF-16
converter at all.

>>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>>> test with three major browsers that use UTF-16 internally and have
>>> independent (of each other) implementations of UTF-8 decoding
>>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>>> Unicode standard away from that kind of interop needs *way* better
>>> rationale than "feels right”.
>
> In what sense is this “interop”?

In the sense that prominent independent implementations do the same
externally observable thing.

> Under what circumstance would it matter how many U+FFFDs you see?

Maybe it doesn't, but I don't think the burden of proof should be on
the person advocating keeping the spec and major implementations as
they are. If anything, I think those arguing for a change of the spec
in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
with the current spec should show why it's important to have a
different number of U+FFFDs than the spec's "best practice" calls for
now.

>  If you’re about to mutter something about security, consider this: security 
> code *should* refuse to compare strings that contain U+FFFD (or at least 
> should never treat them as equal, even to themselves), because it has no way 
> to know what that code point represents.

In practice, e.g. the Web Platform doesn't allow for stopping
operating on input that contains an U+FFFD, so the focus is mainly on
making sure that U+FFFDs are placed well enough to prevent bad stuff
under normal operations. At least typically, the number of U+FFFDs
doesn't matter for that purpose, but when browsers agree on the number
 of U+FFFDs, changing that number should have an overwhelmingly strong
rationale. A security reason could be a strong reason, but such a
security motivation for fewer U+FFFDs has not been shown, to my
knowledge.

> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD (1)
>
> rather than
>
>   U+FFFD   (2)

I advocate (1), most simply because that's what Firefox, Edge and
Chrome do *in accordance with the currently-recommended best practice*
and, less simply, because it makes sense in the presence of a
fail-fast UTF-8 validator. I think the burden of proof to show an
overwhelmingly good reason to change should, at this point, be on
whoever proposes doing it differently than what the current
widely-implemented spec says.

> It’s pretty clear what the intent of the encoder was there, I’d say, and 
> while we certainly don’t want to decode it as a NUL (that was the source of 
> previous security bugs, as I recall), I also don’t see the logic in insisting 
> that it must be decoded to *three* code points when it clearly only 
> represented one in the input.

As noted previously, the logic is that you generate a U+FFFD whenever
a fail-fast validator fails.

> This isn’t just a matter of “feels nicer”.  (1) is simply illogical 
> behaviour, and since behaviours (1) and (2) are both clearly out there today, 
> it makes sense to pick the more logical alternative as the official 
> recommendation.

Again, the current best practice makes perfect logical sense in the
context of a fail-fast UTF-8 validator. Moreover, it doesn't look like
both are "out there" equally when major browsers, OpenJDK and Python 3
agree. (I expect I could find more prominent implementations that
implement the currently-stated best practice, but I feel I shouldn't
have to.) From my experience from working on Web standards and
implementing them, I think it's

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread David Starner via Unicode

On Tue, May 16, 2017 at 12:42 AM Alastair Houghton <
alast...@alastairs-place.net> wrote:

> If you’re about to mutter something about security, consider this:
> security code *should* refuse to compare strings that contain U+FFFD (or at
> least should never treat them as equal, even to themselves), because it has
> no way to know what that code point represents.
>

Which causes various other security problems; if an object (file, database
element, etc.) gets a name with a FFFD in it, it becomes impossible to
reference. That an IEEE 754 float may not equal itself is a perpetual
source of confusion for programmers.

> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD (1)
>
> rather than
>
>   U+FFFD   (2)
>
> It’s pretty clear what the intent of the encoder was there, I’d say, and
> while we certainly don’t want to decode it as a NUL (that was the source of
> previous security bugs, as I recall), I also don’t see the logic in
> insisting that it must be decoded to *three* code points when it clearly
> only represented one in the input.
>

In this case, It's pretty clear, but I don't see it as a general rule.  Any
rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or
mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not
going to insist that it get replaced with U+FFFD U+FFFD because it's clear
(to me) it was meant as two characters.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, 16 May 2017 10:01:03 +0300
Henri Sivonen via Unicode  wrote:

> Even so, I think even changing a recommendation of "best practice"
> needs way better rationale than "feels right" or "ICU already does it"
> when a) major browsers (which operate in the most prominent
> environment of broken and hostile UTF-8) agree with the
> currently-recommended best practice and b) the currently-recommended
> best practice makes more sense for implementations where "UTF-8
> decoding" is actually mere "UTF-8 validation".

There was originally an attempt to prescribe rather than to recommend
the interpretation of ill-formed 8-bit Unicode strings.  It may even
briefly have been an issued prescription, until common sense prevailed.
I do remember a sinking feeling when I thought I would have to change
my own handling of bogus UTF-8, only to be relieved later when it
became mere best practice.  However, it is not uncommon for coding
standards to prescribe 'best practice'.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread J Decker via Unicode

On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
>  wrote:
> > I’m not sure how the discussion of “which is better” relates to the
> > discussion of ill-formed UTF-8 at all.
>
> Clearly, the "which is better" issue is distracting from the
> underlying issue. I'll clarify what I meant on that point and then
> move on:
>
> I acknowledge that UTF-16 as the internal memory representation is the
> dominant design. However, because UTF-8 as the internal memory
> representation is *such a good design* (when legacy constraits permit)
> that *despite it not being the current dominant design*, I think the
> Unicode Consortium should be fully supportive of UTF-8 as the internal
> memory representation and not treat UTF-16 as the internal
> representation as the one true way of doing things that gets
> considered when speccing stuff.
>
> I.e. I wasn't arguing against UTF-16 as the internal memory
> representation (for the purposes of this thread) but trying to
> motivate why the Consortium should consider "UTF-8 internally" equally
> despite it not being the dominant design.
>
> So: When a decision could go either way from the "UTF-16 internally"
> perspective, but one way clearly makes more sense from the "UTF-8
> internally" perspective, the "UTF-8 internally" perspective should be
> decisive in *such a case*. (I think the matter at hand is such a
> case.)
>
> At the very least a proposal should discuss the impact on the "UTF-8
> internally" case, which the proposal at hand doesn't do.
>
> (Moving on to a different point.)
>
> The matter at hand isn't, however, a new green-field (in terms of
> implementations) issue to be decided but a proposed change to a
> standard that has many widely-deployed implementations. Even when
> observing only "UTF-16 internally" implementations, I think it would
> be appropriate for the proposal to include a review of what existing
> implementations, beyond ICU, do.
>
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome)


Something I've learned through working with Node (V8 javascript engine from
chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is
not one OR the other...

https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY

and I wouldn't really assume UTF-16 is a 'majority';  Go is utf-8 for
instance.



> shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 16 May 2017, at 08:22, Asmus Freytag via Unicode  wrote:

> I therefore think that Henri has a point when he's concerned about tacit 
> assumptions favoring one memory representation over another, but I think the 
> way he raises this point is needlessly antagonistic.

That would be true if the in-memory representation had any effect on what we’re 
talking about, but it really doesn’t.

(The only time I can think of that the in-memory representation has a 
significant effect is where you’re talking about default binary ordering of 
string data, in which case, in the presence of non-BMP characters, UTF-8 and 
UCS-4 sort the same way, but because the surrogates are “in the wrong place”, 
UTF-16 doesn’t.  I think everyone is well aware of that, no?)

>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>> test with three major browsers that use UTF-16 internally and have
>> independent (of each other) implementations of UTF-8 decoding
>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>> Unicode standard away from that kind of interop needs *way* better
>> rationale than "feels right”.

In what sense is this “interop”?  Under what circumstance would it matter how 
many U+FFFDs you see?  If you’re about to mutter something about security, 
consider this: security code *should* refuse to compare strings that contain 
U+FFFD (or at least should never treat them as equal, even to themselves), 
because it has no way to know what that code point represents.

Would you advocate replacing

  e0 80 80

with

  U+FFFD U+FFFD U+FFFD (1)

rather than

  U+FFFD   (2)

It’s pretty clear what the intent of the encoder was there, I’d say, and while 
we certainly don’t want to decode it as a NUL (that was the source of previous 
security bugs, as I recall), I also don’t see the logic in insisting that it 
must be decoded to *three* code points when it clearly only represented one in 
the input.

This isn’t just a matter of “feels nicer”.  (1) is simply illogical behaviour, 
and since behaviours (1) and (2) are both clearly out there today, it makes 
sense to pick the more logical alternative as the official recommendation.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 15 May 2017, at 23:43, Richard Wordingham via Unicode  
wrote:
> 
> The problem with surrogates is inadequate testing.  They're sufficiently
> rare for many users that it may be a long time before an error is
> discovered.  It's not always obvious that code is designed for UCS-2
> rather than UTF-16.

While I don’t think we should spend too long debating the relative merits of 
UTF-8 versus UTF-16, I’ll note that that argument applies equally to both 
combining characters and indeed the underlying UTF-8 encoding in the first 
place, and that mistakes in handling both are not exactly uncommon.  There are 
advantages to UTF-8 and advantages to UTF-16.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen  wrote:
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome) shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".

Testing with that file, Python 3 and OpenJDK 8 agree with the
currently-specced best-practice, too. I expect there to be other
well-known implementations that comply with the currently-specced best
practice, so the rationale to change the stated best practice would
have to be very strong (as in: security problem with currently-stated
best practice) for a change to be appropriate.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Asmus Freytag via Unicode


On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote:

On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
 wrote:

I’m not sure how the discussion of “which is better” relates to the
discussion of ill-formed UTF-8 at all.

Clearly, the "which is better" issue is distracting from the
underlying issue. I'll clarify what I meant on that point and then
move on:

I acknowledge that UTF-16 as the internal memory representation is the
dominant design. However, because UTF-8 as the internal memory
representation is *such a good design* (when legacy constraits permit)
that *despite it not being the current dominant design*, I think the
Unicode Consortium should be fully supportive of UTF-8 as the internal
memory representation and not treat UTF-16 as the internal
representation as the one true way of doing things that gets
considered when speccing stuff.
There are cases where it is prohibitive to transcode external data from 
UTF-8 to any other format, as a precondition to doing any work. In these 
situations processing has to be done in UTF-8, effectively making that 
the in-memory representation. I've encountered this issue on separate 
occasions, both for my own code as well as code I reviewed for clients.


I therefore think that Henri has a point when he's concerned about tacit 
assumptions favoring one memory representation over another, but I think 
the way he raises this point is needlessly antagonistic.

At the very least a proposal should discuss the impact on the "UTF-8
internally" case, which the proposal at hand doesn't do.


This is a key point. It may not be directly relevant to any other 
modifications to the standard, but the larger point is to not make 
assumption about how people implement the standard (or any of the 
algorithms).

(Moving on to a different point.)

The matter at hand isn't, however, a new green-field (in terms of
implementations) issue to be decided but a proposed change to a
standard that has many widely-deployed implementations. Even when
observing only "UTF-16 internally" implementations, I think it would
be appropriate for the proposal to include a review of what existing
implementations, beyond ICU, do.

I would like to second this as well.

The level of documented review of existing implementation practices 
tends to be thin (at least thinner than should be required for changing 
long-established edge cases or recommendations, let alone core  
conformance requirements).


Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
test with three major browsers that use UTF-16 internally and have
independent (of each other) implementations of UTF-8 decoding
(Firefox, Edge and Chrome) shows agreement on the current spec: there
is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
6 on the second, 4 on the third and 6 on the last line). Changing the
Unicode standard away from that kind of interop needs *way* better
rationale than "feels right".
It would be good if the UTC could work out some minimal requirements for 
evaluating proposals for changes to properties and algorithms, much like 
the criteria for encoding new code points

A./

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 15 May 2017, at 23:16, Shawn Steele via Unicode  wrote:
> 
> I’m not sure how the discussion of “which is better” relates to the 
> discussion of ill-formed UTF-8 at all.

It doesn’t, which is a point I made in my original reply to Henry.  The only 
reason I answered his anti-UTF-16 rant at all was to point out that some of us 
don’t think UTF-16 is a mistake, and in fact can see various benefits 
(*particularly* as an in-memory representation).

> And to the last, saying “you cannot process UTF-16 without handling 
> surrogates” seems to me to be the equivalent of saying “you cannot process 
> UTF-8 without handling lead & trail bytes”.  That’s how the respective 
> encodings work.

Quite.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, May 16, 2017 at 6:23 AM, Karl Williamson
 wrote:
> On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote:
>>
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
>>
>> The proposal is to make ICU's spec violation conforming. I think there
>> is both a technical and a political reason why the proposal is a bad
>> idea.
>
>
>
> Henri's claim that "The proposal is to make ICU's spec violation conforming"
> is a false statement, and hence all further commentary based on this false
> premise is irrelevant.
>
> I believe that ICU is actually currently conforming to TUS.

Do you mean that ICU's behavior differs from what the PDF claims (I
didn't test and took the assertion in the PDF about behavior at face
value) or do you mean that despite deviating from the
currently-recommended best practice the behavior is conforming,
because the relevant part of the spec is mere best practice and not a
requirement?

> TUS has certain requirements for UTF-8 handling, and it has certain other
> "Best Practices" as detailed in 3.9.  The proposal involves changing those
> recommendations.  It does not involve changing any requirements.

Even so, I think even changing a recommendation of "best practice"
needs way better rationale than "feels right" or "ICU already does it"
when a) major browsers (which operate in the most prominent
environment of broken and hostile UTF-8) agree with the
currently-recommended best practice and b) the currently-recommended
best practice makes more sense for implementations where "UTF-8
decoding" is actually mere "UTF-8 validation".

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8