Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Henri Sivonen via Unicode
Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP escape
>> > sequence. Chrome and Safari do, but their implementations of
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
>> > decoder implementations generally are informed by the Encoding
>> > Standard (though the ISO-2022-JP decoder specifically might not be
>> > yet), and I suspect that Safari's implementation (ICU) is either
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations,
>> > obfuscating the ASCII string "delete", could be accomplished by
>> > alternating between the ASCII and Roman states to that every other
>> > character is in the ASCII state and the rest of the Roman state.
>> >
>> > Is the requirement to generate U+FFFD when there is no content between
>> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
>> > transitions or useless transitions between ASCII and Roman are not
>> > also required to generate U+FFFD? Would it even be feasible (in terms
>> > of interop with legacy encoders) to make useless transitions between
>> > ASCII and Roman generate U+FFFD?
>> >
>> > --
>> > Henri Sivonen
>> > hsivo...@hsivonen.fi
>> > https://hsivonen.fi/
>>
>>
>>
>> --
>> Henri Sivonen
>> hsivo...@hsivonen.fi
>> https://hsivonen.fi/
>>


-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Proposing mostly invisible characters

2019-09-12 Thread Henri Sivonen via Unicode
On Thu, Sep 12, 2019, 15:53 Christoph Päper via Unicode 
wrote:

> ISHY/SIHY is especially useful for encoding (German) noun compounds in
> wrapped titles, e.g. on product labeling, where hyphens are often
> suppressed for stylistic reasons, e.g. orthographically correct
> _Spargelsuppe_, _Spargel-Suppe_ (U+002D) or _Spargel‐Suppe_ (U+2010) may be
> rendered as _Spargel␤Suppe_ and could then be encoded as
> _SpargelSuppe_.
>

Why should this stylistic decision be encoded in the text content as
opposed to being a policy applies on the CSS (or conceptually equivalent)
layer?

>


Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2018-12-10 Thread Henri Sivonen via Unicode
We're about to remove the U+FFFD generation for the case where there
is no content between two ISO-2022-JP escape sequences from the WHATWG
Encoding Standard.

Is there anything wrong with my analysis that U+FFFD generation in
that case is not a useful security measure when unnecessary
transitions between the ASCII and Roman states do not generate U+FFFD?

On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>
> Context: https://github.com/whatwg/encoding/issues/115
>
> Unicode Security Considerations say:
> "3.6.2 Some Output For All Input
>
> Character encoding conversion must also not simply skip an illegal
> input byte sequence. Instead, it must stop with an error or substitute
> a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
> or an escape sequence in the output. (See also Section 3.5 Deletion of
> Code Points.) It is important to do this not only for byte sequences
> that encode characters, but also for unrecognized or "empty"
> state-change sequences. For example:
> [...]
> ISO-2022 shift sequences without text characters before the next shift
> sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
> require at least one character in a text segment between shift
> sequences. Security software written to the formal specification may
> not detect malicious text  (for example, "delete" with a
> shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
> (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>
> The WHATWG Encoding Standard bakes this requirement by the means of
> "ISO-2022-JP output flag"
> (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
> ISO-2022-JP decoder algorithm
> (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>
> encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
> WHATWG spec.
>
> After Gecko switched to encoding_rs from an implementation that didn't
> implement this U+FFFD generation behavior (uconv), a bug has been
> logged in the context of decoding Japanese email in Thunderbird:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>
> Ken Lunde also recalls seeing such email:
> https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>
> The root problem seems to be that the requirement gives ISO-2022-JP
> the unusual and surprising property that concatenating two ISO-2022-JP
> outputs from a conforming encoder can result in a byte sequence that
> is non-conforming as input to a ISO-2022-JP decoder.
>
> Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
> sequence is immediately followed by another ISO-2022-JP escape
> sequence. Chrome and Safari do, but their implementations of
> ISO-2022-JP aren't independent of each other. Moreover, Chrome's
> decoder implementations generally are informed by the Encoding
> Standard (though the ISO-2022-JP decoder specifically might not be
> yet), and I suspect that Safari's implementation (ICU) is either
> informed by Unicode Security Considerations or vice versa.
>
> The example given as rationale in Unicode Security Considerations,
> obfuscating the ASCII string "delete", could be accomplished by
> alternating between the ASCII and Roman states to that every other
> character is in the ASCII state and the rest of the Roman state.
>
> Is the requirement to generate U+FFFD when there is no content between
> ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
> transitions or useless transitions between ASCII and Roman are not
> also required to generate U+FFFD? Would it even be feasible (in terms
> of interop with legacy encoders) to make useless transitions between
> ASCII and Roman generate U+FFFD?
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/



-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Unicode String Models

2018-11-22 Thread Henri Sivonen via Unicode
On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️  wrote:

>
>   * The Python 3.3 model mentions the disadvantages of memory usage
>> cliffs but doesn't mention the associated perfomance cliffs. It would
>> be good to also mention that when a string manipulation causes the
>> storage to expand or contract, there's a performance impact that's not
>> apparent from the nature of the operation if the programmer's
>> intuition works on the assumption that the programmer is dealing with
>> UTF-32.
>>
>
> The focus was on immutable string models, but I didn't make that clear.
> Added some text.
>

Thanks.


>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
>> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
>> optionally, HotSpot
>> (
>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
>> ).
>> That is, text has UTF-16 semantics, but if the high half of every code
>> unit in a string is zero, only the lower half is stored. This has
>> properties analogous to the Python 3.3 model, except non-BMP doesn't
>> expand to UTF-32 but uses UTF-16 surrogate pairs.
>>
>
> Thanks, will add.
>

V8 source code shows it has a OneByteString storage option:
https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium=0=494
. From hearsay, I'm convinced that it means Latin1, but I've failed to find
a clear quotable statement from a V8 developer to that affect.


>   3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
>> have a different type in the type system than byte buffers. To go from
>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
>> has been tagged as valid UTF-8, the validity is trusted completely so
>> that iteration by code point does not have "else" branches for
>> malformed sequences. If data that the type system indicates to be
>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
>> language has a default "safe" side and an opt-in "unsafe" side. The
>> unsafe side is for performing low-level operations in a way where the
>> responsibility of upholding invariants is moved from the compiler to
>> the programmer. It's impossible to violate the UTF-8 validity
>> invariant using the safe part of the language.
>>
>
> Added a quote based on this; please check if it is ok.
>

Looks accurate. Thanks.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-11-22 Thread Henri Sivonen via Unicode
On Wed, Jun 13, 2018 at 2:49 PM Mark Davis ☕️  wrote:
>
> > That is, why is conforming to UAX #31 worth the risk of prohibiting the use 
> > of characters that some users might want to use?
>
> One could parse for certain sequences, putting characters into a number of 
> broad categories. Very approximately:
>
> junk ~= [[:cn:][:cs:][:co:]]+
> whitespace ~= [[:z:][:c:]-junk]+
> syntax ~= [[:s:][:p:]] // broadly speaking, including both the language 
> syntax  user-named operators
> identifiers ~= [all-else]+
>
> UAX #31 specifies several different kinds of identifiers, and takes roughly 
> that approach for 
> http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the 
> focus there is on immutability.
>
> So an implementation could choose to follow that course, rather than the more 
> narrowly defined identifiers in 
> http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, 
> one can conform to the Default Identifiers but declare a profile that expands 
> the allowable characters. One could take a Swiftian approach, for example...

Thank you and sorry about my slow reply. Why is excluding junk important?

> On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode  wrote:
>>
>> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen  wrote:
>> > Considering that ruling out too much can be a problem later, but just
>> > treating anything above ASCII as opaque hasn't caused trouble (that I
>> > know of) for HTML other than compatibility issues with XML's stricter
>> > stance, why should a programming language, if it opts to support
>> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
>> > complexity of UAX #31 instead of allowing everything above ASCII in
>> > identifiers? In other words, what problem does making a programming
>> > language conform to UAX #31 solve?
>>
>> After refreshing my memory of XML history, I realize that mentioning
>> XML does not helpfully illustrate my question despite the mention of
>> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
>> ignore the XML part.
>>
>> Trying to rephrase my question more clearly:
>>
>> Let's assume that we are designing a computer-parseable syntax where
>> tokens consisting of user-chosen characters can't occur next to each
>> other and, instead, always have some syntax-reserved characters
>> between them. That is, I'm talking about syntaxes that look like this
>> (could be e.g. Java):
>>
>> ab.cd();
>>
>> Here, ab and cd are tokens with user-chosen characters whereas space
>> (the indent),  period, parenthesis and the semicolon are
>> syntax-reserved. We know that ab and cd are distinct tokens, because
>> there is a period between them, and we know the opening parethesis
>> ends the cd token.
>>
>> To illustrate what I'm explicitly _not_ talking about, I'm not talking
>> about a syntax like this:
>>
>> αβ⊗γδ
>>
>> Here αβ and γδ are user-named variable names and ⊗ is a user-named
>> operator and the distinction between different kinds of user-named
>> tokens has to be known somehow in order to be able to tell that there
>> are three distinct tokens: αβ, ⊗, and γδ.
>>
>> My question is:
>>
>> When designing a syntax where tokens with the user-chosen characters
>> can't occur next to each other without some syntax-reserved characters
>> between them, what advantages are there from limiting the user-chosen
>> characters according to UAX #31 as opposed to treating any character
>> that is not a syntax-reserved character as a character that can occur
>> in user-named tokens?
>>
>> I understand that taking the latter approach allows users to mint
>> tokens that on some aesthetic measure don't make sense (e.g. minting
>> tokens that consist of glyphless code points), but why is it important
>> to prescribe that this is prohibited as opposed to just letting users
>> choose not to mint tokens that are inconvenient for them to work with
>> given the behavior that their plain text editor gives to various
>> characters? That is, why is conforming to UAX #31 worth the risk of
>> prohibiting the use of characters that some users might want to use?
>> The introduction of XID after ID and the introduction of Extended
>> Hashtag Identifiers after XID is indicative of over-restriction having
>> been a problem.
>>
>> Limiting user-minted tokens to UAX #31 does not appear to be necessary
>> for security purposes considering that HTML and CSS exist in a
>> particularly adversarial environment and get away with taking the
>> approach that any character that isn't a syntax-reserved character is
>> collected as part of a user-minted identifier. (Informally, both treat
>> non-ASCII characters the same as an ASCII underscore. HTML even treats
>> non-whitespace, non-U+ ASCII controls that way.)
>>
>> --
>> Henri Sivonen
>> hsivo...@hsivonen.fi
>> https://hsivonen.fi/
>>
>


-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Unicode String Models

2018-09-12 Thread Henri Sivonen via Unicode
On Wed, Sep 12, 2018 at 11:37 AM Hans Åberg via Unicode
 wrote:
> The idea is to extend Unicode itself, so that those bytes can be represented 
> by legal codepoints.

Extending Unicode itself would likely create more problems that it
would solve. Extending the value space of Unicode scalar values would
be extremely disruptive for systems whose design is deeply committed
to the current definitions of UTF-16 and UTF-8 staying unchanged.
Assigning a scalar value within the current Unicode scalar value space
to currently malformed bytes would have the problem of those scalar
values losing information whether they came from malformed bytes or
the well-formed encoding of those scalar values.

It seems better to let applications that have use cases that involve
representing non-Unicode values to use a special-purpose extension on
their own.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii  wrote:
>
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode 
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

I think extensions of UTF-8 that expand the value space beyond Unicode
scalar values and the problems these extensions are designed to solve
is a worthwhile topic to cover, but I think it's not the same topic as
in the document but a slightly adjacent topic.

On that topic, these two are relevant:
https://simonsapin.github.io/wtf-8/
https://github.com/kennytm/omgwtf8

The former is used in the Rust standard library in order to provide a
Unix-like view to Windows file paths in a way that can represent all
Windows file paths. File paths on Unix-like systems are sequences of
bytes whose presentable-to-humans interpretation (these days) is
UTF-8, but there's no guarantee of UTF-8 validity. File paths on
Windows are are sequences of unsigned 16-bit numbers whose
presentable-to-humans interpretation is UTF-16, but there's no
guarantee of UTF-16 validity. WTF-8 can represent all Windows file
paths as sequences of bytes such that the paths that are valid UTF-16
as sequences of 16-bit units are valid UTF-8 in the 8-bit-unit
representation. This allows application-visible file paths in the Rust
standard library to be sequences of bytes both on Windows and
non-Windows platforms and to be presentable to humans by decoding as
UTF-8 in both cases.

To my knowledge, the latter isn't in use yet. The implementation is
tracked in https://github.com/rust-lang/rust/issues/49802

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
 wrote:
>
> I recently did some extensive revisions of a paper on Unicode string models 
> (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

* The Grapheme Cluster Model seems to have a couple of disadvantages
that are not mentioned:
  1) The subunit of string is also a string (a short string conforming
to particular constraints). There's a need for *another* more atomic
mechanism for examining the internals of the grapheme cluster string.
  2) The way an arbitrary string is divided into units when iterating
over it changes when the program is executed on a newer version of the
language runtime that is aware of newly-assigned codepoints from a
newer version of Unicode.

 * The Python 3.3 model mentions the disadvantages of memory usage
cliffs but doesn't mention the associated perfomance cliffs. It would
be good to also mention that when a string manipulation causes the
storage to expand or contract, there's a performance impact that's not
apparent from the nature of the operation if the programmer's
intuition works on the assumption that the programmer is dealing with
UTF-32.

 * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
text node storage in Gecko, (I believe but am not 100% sure) V8 and,
optionally, HotSpot
(https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A).
That is, text has UTF-16 semantics, but if the high half of every code
unit in a string is zero, only the lower half is stored. This has
properties analogous to the Python 3.3 model, except non-BMP doesn't
expand to UTF-32 but uses UTF-16 surrogate pairs.

 * I think the fact that systems that chose UTF-16 or UTF-32 have
implemented models that try to save storage by omitting leading zeros
and gaining complexity and performance cliffs as a result is a strong
indication that UTF-8 should be recommended for newly-designed systems
that don't suffer from a forceful legacy need to expose UTF-16 or
UTF-32 semantics.

 * I suggest splitting the "UTF-8 model" into three substantially
different models:

 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
UTF-8-related operations are performed when ingesting byte-oriented
data. Byte buffers and text buffers are type-wise ambiguous. Only
iterating over byte data by code point gives the data the UTF-8
interpretation. Unless the data is cleaned up as a side effect of such
iteration, malformed sequences in input survive into output.

 2) UTF-8 without full trust in ability to retain validity (the model
of the UTF-8-using C++ parts of Gecko; I believe this to be the most
common UTF-8 model for C and C++, but I don't have evidence to back
this up): When data is ingested with text semantics, it is converted
to UTF-8. For data that's supposed to already be in UTF-8, this means
replacing malformed sequences with the REPLACEMENT CHARACTER, so the
data is valid UTF-8 right after input. However, iteration by code
point doesn't trust ability of other code to retain UTF-8 validity
perfectly and has "else" branches in order not to blow up if invalid
UTF-8 creeps into the system.

 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
have a different type in the type system than byte buffers. To go from
a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
has been tagged as valid UTF-8, the validity is trusted completely so
that iteration by code point does not have "else" branches for
malformed sequences. If data that the type system indicates to be
valid UTF-8 wasn't actually valid, it would be nasal demon time. The
language has a default "safe" side and an opt-in "unsafe" side. The
unsafe side is for performing low-level operations in a way where the
responsibility of upholding invariants is moved from the compiler to
the programmer. It's impossible to violate the UTF-8 validity
invariant using the safe part of the language.

 * After working with different string models, I'd recommend the Rust
model for newly-designed programming languages. (Not because I work
for Mozilla but because I believe Rust's way of dealing with Unicode
is the best I've seen.) Rust's standard library provides Unicode
version-independent iterations over strings: by code unit and by code
point. Iteration by extended grapheme cluster is provided by a library
that's easy to include due to the nature of Rust package management
(https://crates.io/crates/unicode_segmentation). Viewing a UTF-8
buffer as a read-only byte buffer has zero run-time cost and allows
for maximally fast guaranteed-valid-UTF-8 output.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-08 Thread Henri Sivonen via Unicode
On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen  wrote:
> Considering that ruling out too much can be a problem later, but just
> treating anything above ASCII as opaque hasn't caused trouble (that I
> know of) for HTML other than compatibility issues with XML's stricter
> stance, why should a programming language, if it opts to support
> non-ASCII identifiers in an otherwise ASCII core syntax, implement the
> complexity of UAX #31 instead of allowing everything above ASCII in
> identifiers? In other words, what problem does making a programming
> language conform to UAX #31 solve?

After refreshing my memory of XML history, I realize that mentioning
XML does not helpfully illustrate my question despite the mention of
XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
ignore the XML part.

Trying to rephrase my question more clearly:

Let's assume that we are designing a computer-parseable syntax where
tokens consisting of user-chosen characters can't occur next to each
other and, instead, always have some syntax-reserved characters
between them. That is, I'm talking about syntaxes that look like this
(could be e.g. Java):

ab.cd();

Here, ab and cd are tokens with user-chosen characters whereas space
(the indent),  period, parenthesis and the semicolon are
syntax-reserved. We know that ab and cd are distinct tokens, because
there is a period between them, and we know the opening parethesis
ends the cd token.

To illustrate what I'm explicitly _not_ talking about, I'm not talking
about a syntax like this:

αβ⊗γδ

Here αβ and γδ are user-named variable names and ⊗ is a user-named
operator and the distinction between different kinds of user-named
tokens has to be known somehow in order to be able to tell that there
are three distinct tokens: αβ, ⊗, and γδ.

My question is:

When designing a syntax where tokens with the user-chosen characters
can't occur next to each other without some syntax-reserved characters
between them, what advantages are there from limiting the user-chosen
characters according to UAX #31 as opposed to treating any character
that is not a syntax-reserved character as a character that can occur
in user-named tokens?

I understand that taking the latter approach allows users to mint
tokens that on some aesthetic measure don't make sense (e.g. minting
tokens that consist of glyphless code points), but why is it important
to prescribe that this is prohibited as opposed to just letting users
choose not to mint tokens that are inconvenient for them to work with
given the behavior that their plain text editor gives to various
characters? That is, why is conforming to UAX #31 worth the risk of
prohibiting the use of characters that some users might want to use?
The introduction of XID after ID and the introduction of Extended
Hashtag Identifiers after XID is indicative of over-restriction having
been a problem.

Limiting user-minted tokens to UAX #31 does not appear to be necessary
for security purposes considering that HTML and CSS exist in a
particularly adversarial environment and get away with taking the
approach that any character that isn't a syntax-reserved character is
collected as part of a user-minted identifier. (Informally, both treat
non-ASCII characters the same as an ASCII underscore. HTML even treats
non-whitespace, non-U+ ASCII controls that way.)

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Henri Sivonen via Unicode
On Mon, Jun 4, 2018 at 10:49 PM, Manish Goregaokar via Unicode
 wrote:
> The Rust community is considering adding non-ascii identifiers, which follow
> UAX #31 (XID_Start XID_Continue*, with tweaks).

UAX #31 is rather light on documenting its rationale.

I realize that XML is a different case from Rust considering how the
Rust compiler is something a programmer runs locally whereas control
XML documents and XML processors, especially over time, is
significantly less coupled.

Still, the experience from XML and HTML suggests that, if non-ASCII is
to be allowed in identifiers at all, restricting the value space of
identifiers a priori easily ends up restricting too. HTML went with
the approach of collecting everything up to the next ASCII code point
that's a delimiter in HTML (and a later check for names that are
eligible for Custom Element treatment that mainly achieves
compatibility with XML but no such check for what the parser can
actually put in the document tree) while keeping the actual vocabulary
to ASCII (except for Custom Elements whose seemingly arbitrary
restrictions are inherited from XML).

XML 1.0 codified for element and attribute names what then was the
understanding of the topic that UAX #31 now covers and made other
cases a hard failure. Later, it turned out that XML originally ruled
out too much and the whole mess that was XML 1.1 and XML 1.0 5th ed.
resulted from trying to relax the rules.

Considering that ruling out too much can be a problem later, but just
treating anything above ASCII as opaque hasn't caused trouble (that I
know of) for HTML other than compatibility issues with XML's stricter
stance, why should a programming language, if it opts to support
non-ASCII identifiers in an otherwise ASCII core syntax, implement the
complexity of UAX #31 instead of allowing everything above ASCII in
identifiers? In other words, what problem does making a programming
language conform to UAX #31 solve?

Allowing anything above ASCII will lead to some cases that obviously
don't make sense, such as declaring a function whose name is a
paragraph separator, but why is it important to prohibit that kind of
thing when prohibiting things risks prohibiting too much, as happened
with XML, and people just don't mint identifiers that aren't practical
to them? Is there some important badness prevention concern that
applies to programming languages more than it applies to HTML? The key
thing here in terms of considering if badness is _prevented_ isn't
what's valid HTML but what the parser can actually put in the DOM, and
the HTML parser can actually put any non-ASCII code point in the DOM
as an element or attribute name (after the initial ASCII code point).

(The above question is orthogonal to normalization. I do see the value
of normalizing identifiers to NFC or requiring them to be in NFC to
begin with. I'm inclined to consider NFKC as a bug in the Rust
proposal.)
-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Is the Editor's Draft public?

2018-04-20 Thread Henri Sivonen via Unicode
On Fri, Apr 20, 2018 at 12:16 PM, Martin J. Dürst
 wrote:
> On 2018/04/20 18:12, Martin J. Dürst wrote:
>
>> There was an announcement for a public review period just recently. The
>> review period is up to the 23rd of April. I'm not sure whether the
>> announcement is up somewhere on the Web, but I'll forward it to you
>> directly.
>
> Sorry, found the Web address of the announcement at the very bottom of the
> mail: http://blog.unicode.org/2018/04/last-call-on-unicode-110-review.html

Thank you. I checked this review announcement (I should have said so
in my email; sorry), but it leads me to
https://unicode.org/versions/Unicode11.0.0/ which says the chapters
will be "Available June 2018". But even if the 11.0 chapters were
available, I'd expect there to exist an Editor's Draft that's now in a
post-11.0 but pre-12.0 state.

I guess I should just send my comments and take the risk of my
concerns already having been addressed.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Is the Editor's Draft public?

2018-04-20 Thread Henri Sivonen via Unicode
Is the Editor's Draft of the Unicode Standard visible publicly?

Use case: Checking if things that I might send feedback about have
already been addressed since the publication of Unicode 10.0.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


PDF restrictions on the Unicode Standard 10.0

2018-01-13 Thread Henri Sivonen via Unicode
I was reading 
https://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
on a Sony Digital Paper device and tried to scribble some notes and
make highlights but I couldn't. I still couldn't after ensuring that
the pen was charged and could write on other PDFs.

Since Evince told me just "Security: No", since the Digital Paper's UI
for designating non-editability is easy to miss and since there's no
password required to open the file, it took me non-trivial time to
figure out what was going on.

Upon examining the PDF in Acrobat Reader, it turned out that even
though the PDF can be viewed, printed and copied from without
artificial restrictions, there are various restriction bits set for
modifying the file. (Screenshot:
https://hsivonen.fi/screen/unicode-pdf-restrictions.png )

It doesn't make sense to me that the Consortium restricts me from
adding highlights or handwriting if I open the Standard on an e-Ink
device even though I can do those things if I print the PDF.

I'd like to request that going forward the Consortium refrain from
using restriction bits or any "security" on the PDFs it publishes.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Inadvertent copies of test data in L2/17-197 ?

2017-08-07 Thread Henri Sivonen via Unicode
On Mon, Aug 7, 2017 at 9:53 AM, Martin J. Dürst  wrote:
> I just had a look at http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf
> to use the test data in there for Ruby.
> I was under the impression from previous looks at it that it contained a lot
> of test data.

It contains the test outputs with identical results (output exhibiting
the spec-following behavior and output exhibiting the one REPLACEMENT
CHARACTER per bogus byte behavior) shown only once. Since the input
doesn't make sense as a PDF, it only mentions where to find the input
(https://hsivonen.fi/broken-utf-8/test.html).

> However, when I looked at the test data more carefully (I had
> read the text before the test data carefully at least two times before, but
> not looked at the test data in that much detail), I discovered that there
> might be up to 7 copies of the same data. The first one starts on page 9,
> and then there's a new one about every 4 or 5 pages.
>
> Can you check/confirm? Any idea what might have caused this?

The test outputs are not identical. They should be the content of the
following files with a bit of introductory text before each:
https://hsivonen.fi/broken-utf-8/spec.html
https://hsivonen.fi/broken-utf-8/one-per-byte.html
https://hsivonen.fi/broken-utf-8/win32.html
https://hsivonen.fi/broken-utf-8/java.html
https://hsivonen.fi/broken-utf-8/python2.html with non-conforming
output replaced with italic text saying what the bytes were
https://hsivonen.fi/broken-utf-8/perl5.html
https://hsivonen.fi/broken-utf-8/icu.html

I inspected the PDF multiple times just now, and, as far as I can
tell, the content indeed matches what I described above (no
duplicates).

For reference, I tested the Ruby standard library with the following program:

data = IO.read("test.html", encoding: "UTF-8")
encoded = data.encode("UTF-16LE", :invalid=>:replace).encode("UTF-8")
IO.write("ruby.html", encoded)

...where test.html was the file available at
https://hsivonen.fi/broken-utf-8/test.html

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-04 Thread Henri Sivonen via Unicode
On Fri, Aug 4, 2017 at 3:34 AM, Mark Davis ☕️ via Unicode
 wrote:
> FYI, the UTC retracted the following.
>
> [151-C19] Consensus: Modify the section on "Best Practices for Using FFFD"
> in section "3.9 Encoding Forms" of TUS per the recommendation in L2/17-168,
> for Unicode version 11.0.

Thank you!

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-29 Thread Henri Sivonen via Unicode
On Sat Jun 3 23:09:01 CDT 2017Sat Jun 3 23:09:01 CDT 2017 Markus Scherer wrote:
> I suggest you submit a write-up via http://www.unicode.org/reporting.html
>
> and make the case there that you think the UTC should retract
>
> http://www.unicode.org/L2/L2017/17103.htm#151-C19

The submission has been made:
http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf

> Also, since Chromium/Blink/v8 are using ICU, I suggest you submit an ICU
> ticket via http://bugs.icu-project.org/trac/newticket

Although they use ICU for most legacy encodings, they don't use ICU
for UTF-8. Hence, the difference between Chrome and ICU in the above
write-up.

> and make the case there, too, that you think (assuming you do) that ICU
> should change its handling of illegal UTF-8 sequences.

Whether I think ICU should change isn't quite that simple.

On one hand, a key worry that I have about Unicode changing the
long-standing guidance for UTF-8 error handling is that inducing
implementations to change (either by the developers feeling that they
have to implement the "best practice" or by others complaining when
"best practice" isn't implemented) is wasteful and a potential source
of bugs. In that sense, I feel I shouldn't ask ICU to change, either.

On the other hand, I care about implementations of the WHATWG Encoding
Standard being compliant and it appears that Node.js is on track to
exposing ICU's UTF-8 decoder via the WHATWG TextDecoder API:
https://github.com/nodejs/node/pull/13644 . Additionally, this episode
of ICU behavior getting cited in a proposal to change the guidance in
the Unicode Standard is a reason why I'd be happier if ICU followed
the Unicode 10-and-earlier / WHATWG behavior, since there wouldn't be
the risk of ICU's behavior getting cited as a different reference as
happened with the proposal to change the guidance for Unicode 11.

Still, since I'm not affiliated with the Node.js implementation, I'm a
bit worried that if I filed an ICU bug on Node's behalf, I'd be
engaging in the kind of behavior towards ICU that I don't want to see
towards other implementations, including the one I've written, in
response to the new pending Unicode 11 guidance (which I'm requesting
be retracted), so at this time I haven't filed an ICU bug on Node's
behalf and have instead mentioned the difference between ICU and the
WHATWG spec when my input on testing the Node TextDecoder
implementation was sought
(https://github.com/nodejs/node/issues/13646#issuecomment-308084459).

>> But the matter at hand is decoding potentially-invalid UTF-8 input
>> into a valid in-memory Unicode representation, so later processing is
>> somewhat a red herring as being out of scope for this step. I do agree
>> that if you already know that the data is valid UTF-8, it makes sense
>> to work from the bit pattern definition only.
>
> No, it's not a red herring. Not every piece of software has a neat "inside"
> with all valid text, and with a controllable surface to the "outside".

Fair enough. However, I don't think this supports adopting the ICU
behavior as "best practice" when looking at a prominent real-world
example of such a system.

The Go programming language is a example of a system that post-dates
UTF-8, is even designed by the same people as UTF-8 and where strings
in memory are potentially-invalid UTF-8, i.e. there isn't a clear
distinction with UTF-8 on the outside and UTF-8 on the inside. (In
contrast to e.g. Rust where the type system maintains a clear
distinction between byte buffers and strings, and strings are
guaranteed-valid UTF-8.)

Go bakes UTF-8 error handling in the language spec by specifying
per-code point iteration over potentially-invalid in-memory UTF-8
buffers. See item 2 in the list at
https://golang.org/ref/spec#For_range .

The behavior baked into the language is one REPLACEMENT CHARACTER per
bogus byte, which is neither the Unicode 10-and-earlier "best
practice" nor the ICU behavior. However, it is closer to the Unicode
10-and-earlier "best practice" than to the ICU behavior. (It differs
from the Unicode-and-earlier behavior only for truncated sequences
that form a prefix of a valid sequence.)

(To be clear, I not saying that the guidance in the Unicode Standard
should be changed to match Go, either. I'm just saying that Go is an
example of a prominent system with ambiguous inside and outside for
UTF-8 and it exhibits behavior closer to Unicode 10 than to ICU and,
therefore, is not a data point in favor of adopting the ICU behavior.)

-- 
Henri Sivonen
hsivo...@mozilla.com


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Henri Sivonen via Unicode
On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode
<unicode@unicode.org> wrote:
> On Wed, 31 May 2017 15:12:12 +0300
> Henri Sivonen via Unicode <unicode@unicode.org> wrote:
>> I am not claiming it's too difficult to implement. I think it
>> inappropriate to ask implementations, even from-scratch ones, to take
>> on added complexity in error handling on mere aesthetic grounds. Also,
>> I think it's inappropriate to induce implementations already written
>> according to the previous guidance to change (and risk bugs) or to
>> make the developers who followed the previous guidance with precision
>> be the ones who need to explain why they aren't following the new
>> guidance.
>
> How straightforward is the FSM for back-stepping?

This seems beside the point, since the new guidance wasn't advertised
as improving backward stepping compared to the old guidance.

(On the first look, I don't see the new guidance improving back
stepping. In fact, if the UTC meant to adopt ICU's behavior for
obsolete five and six-byte bit patterns, AFAICT, backstepping with the
ICU behavior requires examining more bytes backward than the old
guidance required.)

>> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
>> <unicode@unicode.org> wrote:
>> > The UTF-8 conversion code that I wrote for ICU, and apparently the
>> > code that various other people have written, collects sequences
>> > starting from lead bytes, according to the original spec, and at
>> > the end looks at whether the assembled code point is too low for
>> > the lead byte, or is a surrogate, or is above 10. Stopping at a
>> > non-trail byte is quite natural, and reading the PRI text
>> > accordingly is quite natural too.
>>
>> I don't doubt that other people have written code with the same
>> concept as ICU, but as far as non-shortest form handling goes in the
>> implementations I tested (see URL at the start of this email) ICU is
>> the lone outlier.
>
> You should have researched implementations as they were in 2007.

I don't see how the state of things in 2007 is relevant to a decision
taken in 2017. It's relevant that by 2017, prominent implementations
had adopted the old Unicode guidance, and, that being the case, it's
inappropriate to change the guidance for aesthetic reasons or to favor
the Unicode Consortium-hosted implementation.

On Wed, May 31, 2017 at 8:43 PM, Shawn Steele via Unicode
<unicode@unicode.org> wrote:
> I do not understand the energy being invested in a case that shouldn't 
> happen, especially in a case that is a subset of all the other bad cases that 
> could happen.

I'm a browser developer. I've explained previously on this list and in
my blog post why the browser developer / Web standard culture favors
well-defined behavior in error cases these days.

On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode
<unicode@unicode.org> wrote:
> Henri Sivonen wrote:
>
>> If anything, I hope this thread results in the establishment of a
>> requirement for proposals to come with proper research about what
>> multiple prominent implementations to about the subject matter of a
>> proposal concerning changes to text about implementation behavior.
>
> Considering that several folks have objected that the U+FFFD
> recommendation is perceived as having the weight of a requirement, I
> think adding Henri's good advice above as a "requirement" seems
> heavy-handed. Who will judge how much research qualifies as "proper"?

In the Unicode scope, it's indeed harder to draw clear line to decide
what the prominent implementations are than in the WHATWG scope. The
point is that just checking ICU is not good enough. Someone making a
proposal should check the four major browser engines and a bunch of
system frameworks and standard libraries for well-known programming
languages. Which frameworks and standard libraries and how many is not
precisely definable objectively and depends on the subject matter
(there are many UTF-8 decoders but e.g. fewer text shaping engines).
There will be diminishing returns to checking them. Chances are that
it's not necessary to check too many for a pattern to emerge to judge
whether the existing spec language is being implemented (don't change
it) or being ignored (probably should be changed then).

In any case, "we can't check everything or choose fairly what exactly
to check" shouldn't be a reason for it to be OK to just check ICU or
to make abstract arguments without checking any implementations at
all. Checking multiple popular implementations is homework better done
than just checking ICU even if it's up to the person making the
proposal to choose which implementations to check exactly. The
committee should be abl

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Henri Sivonen via Unicode
I've researched this more. While the old advice dominates the handling
of non-shortest forms, there is more variation than I previously
thought when it comes to truncated sequences and CESU-8-style
surrogates. Still, the ICU behavior is an outlier considering the set
of implementations that I tested.

I've written up my findings at https://hsivonen.fi/broken-utf-8/

The write-up mentions
https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd
like to draw everyone's attention to that bug, which is real-world
evidence of a bug arising from two UTF-8 decoders within one product
handling UTF-8 errors differently.

On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode
 wrote:
> There is plenty of time for public comment, since it was targeted at Unicode
> 11, the release for about a year from now, not Unicode 10, due this year.
> When the UTC "approves a change", that change is subject to comment, and the
> UTC can always reverse or modify its approval up until the meeting before
> release date. So there are ca. 9 months in which to comment.

What should I read to learn how to formulate an appeal correctly?

Does it matter if a proposal/appeal is submitted as a non-member
implementor person, as an individual person member or as a liaison
member? http://www.unicode.org/consortium/liaison-members.html list
"the Mozilla Project" as a liaison member, but Mozilla-side
conventions make submitting proposals like this "as Mozilla"
problematic (we tend to avoid "as Mozilla" statements on technical
standardization fora except when the W3C Process forces us to make
them as part of charter or Proposed Recommendation review).

> The modified text is a set of guidelines, not requirements. So no
> conformance clause is being changed.

I'm aware of this.

> If people really believed that the guidelines in that section should have
> been conformance clauses, they should have proposed that at some point.

It seems to me that this thread does not support the conclusion that
the Unicode Standard's expression of preference for the number of
REPLACEMENT CHARACTERs should be made into a conformance requirement
in the Unicode Standard. This thread could be taken to support a
conclusion that the Unicode Standard should not express any preference
beyond "at least one and at most as many as there were bytes".

On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode
 wrote:
>  In any case, Henri is complaining that it’s too difficult to implement; it 
> isn’t.  You need two extra states, both of which are trivial.

I am not claiming it's too difficult to implement. I think it
inappropriate to ask implementations, even from-scratch ones, to take
on added complexity in error handling on mere aesthetic grounds. Also,
I think it's inappropriate to induce implementations already written
according to the previous guidance to change (and risk bugs) or to
make the developers who followed the previous guidance with precision
be the ones who need to explain why they aren't following the new
guidance.

On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
 wrote:
> The UTF-8 conversion code that I wrote for ICU, and apparently the code that
> various other people have written, collects sequences starting from lead
> bytes, according to the original spec, and at the end looks at whether the
> assembled code point is too low for the lead byte, or is a surrogate, or is
> above 10. Stopping at a non-trail byte is quite natural, and reading the
> PRI text accordingly is quite natural too.

I don't doubt that other people have written code with the same
concept as ICU, but as far as non-shortest form handling goes in the
implementations I tested (see URL at the start of this email) ICU is
the lone outlier.

> Aside from UTF-8 history, there is a reason for preferring a more
> "structural" definition for UTF-8 over one purely along valid sequences.
> This applies to code that *works* on UTF-8 strings rather than just
> converting them. For UTF-8 *processing* you need to be able to iterate both
> forward and backward, and sometimes you need not collect code points while
> skipping over n units in either direction -- but your iteration needs to be
> consistent in all cases. This is easier to implement (especially in fast,
> short, inline code) if you have to look only at how many trail bytes follow
> a lead byte, without having to look whether the first trail byte is in a
> certain range for some specific lead bytes.

But the matter at hand is decoding potentially-invalid UTF-8 input
into a valid in-memory Unicode representation, so later processing is
somewhat a red herring as being out of scope for this step. I do agree
that if you already know that the data is valid UTF-8, it makes sense
to work from the bit pattern definition only. (E.g. in encoding_rs,
the implementation I've written and that's on track to replacing uconv
in Firefox, UTF-8 decode works 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Henri Sivonen via Unicode
On Thu, May 18, 2017 at 2:41 AM, Asmus Freytag via Unicode
 wrote:
> On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote:
>
> There's some sort of rule that proposals should be made seven days in
> advance of the meeting.  I can't find it now, so I'm not sure whether
> the actual rule was followed, let alone what authority it has.
>
> Ideally, proposals that update algorithms or properties of some significance
> should be required to be reviewed in more than one pass. The procedures of
> the UTC are a bit weak in that respect, at least compared to other standards
> organizations. The PRI process addresses that issue to some extent.

What action should I take to make proposals to be considered by the UTC?

I'd like to make two:

 1) Substantive: Reverse the decision to modify U+FFFD best practice
when decoding UTF-8. (I think the decision lacked a truly compelling
reason to change something that has a number of prominent
implementations and the decision complicates U+FFFD generation when
validating UTF-8 by state machine. Aesthetic considerations in error
handling shouldn't outweigh multiple prominent implementations and
shouldn't introduce implementation complexity.)

 2) Procedural: To be considered in the future, proposals to change
what the standard suggests or requires implementations to do should
consider different implementation strategies and discuss the impact of
the change in the light of the different implementation strategies (in
the matter at hand, I think the proposal should have included a
discussion of the impact on UTF-8 validation state machines) and
should include a review of what prominent implementations, including
major browser engines, operating system libraries, and standard
libraries of well-known programming languages, already do. (The more
established the presently specced behavior is among prominent
implementations, the more compelling reason should be required to
change the spec. An implementation hosted by the Consortium itself
shouldn't have special weight compared to other prominent
implementations.)

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 9:36 PM, Markus Scherer  wrote:
> Let me try to address some of the issues raised here.

Thank you.

> The proposal changes a recommendation, not a requirement.

This is a very bad reason in favor of the change. If anything, this
should be a reason why there is no need to change the spec text.

> Conformance
> applies to finding and interpreting valid sequences properly. This includes
> not consuming parts of valid sequences when dealing with illegal ones, as
> explained in the section "Constraints on Conversion Processes".
>
> Otherwise, what you do with illegal sequences is a matter of what you think
> makes sense -- a matter of opinion and convenience. Nothing more.

This may be the Unicode-level view of error handling. It isn't the
Web-level view of error handling. In the world of Web standards (i.e.
standards that read on the behavior of browsers engines), we've
learned that implementation-defined behavior is bad, because someone
makes a popular site that depends on the implementation-defined
behavior of the browser they happened to test in. For this reason, the
WHATWG has since 2004 written specs that are well-defined even in
corner cases and for non-conforming input, and we've tried to extend
this culture into the W3C, too. (Sometimes, exceptions are made when
there's a very good reason to handle a corner case differently in a
given implementatino: A recent example is CSS allowing the
non-preservation of lone surrogates entering the CSS Object Model via
JavaScript strings in order to enable CSS Object Model implementations
that use UTF-8 [really UTF-8 and not some almost-UTF-8 variant]
internally. But, yes, we really do sweat the details on that level.)

Even if one could argue that implementation-defined behavior on the
topic of number of U+FFFDs for ill-formed sequences in UTF-8 decode
doesn't matter, the WHATWG way of doing things isn't to debate whether
implementation-defined behavior matters in this particular case but to
require one particular behavior in order to have well-defined behavior
even when input is non-conforming.

It further seems that there are people who do care about what's a
*requirement* on the WHATWG level matching what's "best practice" on
the Unicode level:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=19938

Now that major browsers agree, knowing what I know about how the
WHATWG operates, while I can't speak for Anne, I expect the WHATWG
spec to say as-is, because it now matches the browser consensus.

So as a practical matter, if Unicode now changes its "best practice",
when people check consistency with Unicode-level "best practice" and
notice a discrepancy, the WHATWG and developers of implementations
that took the previously-stated "best practice" seriously (either
directly or by the means of another spec, like the WHATWG Encoding
Standard, elevating it to a *requirement*) will need to explain why
they don't follow the best practice.

It is really inappropriate to inflict that trouble onto pretty much
everyone except ICU when the rationale for change is as flimsy as
"feels right". And, as noted earlier, politically it looks *really
bad* for Unicode to change its own previous recommendation to side
with ICU not following it when a number of other prominent
implementations do.

> I believe that the discussion of how to handle illegal sequences came out of
> security issues a few years ago from some implementations including valid
> single and lead bytes with preceding illegal sequences.
...
> Why do we care how we carve up an illegal sequence into subsequences? Only
> for debugging and visual inspection.
...
> If you don't like some recommendation, then do something else. It does not
> matter. If you don't reject the whole input but instead choose to replace
> illegal sequences with something, then make sure the something is not
> nothing -- replacing with an empty string can cause security issues.
> Otherwise, what the something is, or how many of them you put in, is not
> very relevant. One or more U+FFFDs is customary.

When the recommendation came about for security reasons, it's a really
bad idea that to suggest that implementors should decide on their own
what to do and trust that their decision deviates little enough from
the suggestion to stay on the secure side. To be clear, I'm not, at
this time, claiming that the number of U+FFFDs has a security
consequence as long as the number is at least one, but there's an
awfully short slippery slope to giving the caller of a converter API
the option to "ignore errors", i.e. make the number zero, which *is*,
as you note, a security problem.

> When the current recommendation came in, I thought it was reasonable but
> didn't like the edge cases. At the time, I didn't think it was important to
> twiddle with the text in the standard, and I didn't care that ICU didn't
> exactly implement that particular recommendation.

If ICU doesn't care, then it should be ICU developers and 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton
<alast...@alastairs-place.net> wrote:
> On 16 May 2017, at 09:31, Henri Sivonen via Unicode <unicode@unicode.org> 
> wrote:
>>
>> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
>> <alast...@alastairs-place.net> wrote:
>>> That would be true if the in-memory representation had any effect on what 
>>> we’re talking about, but it really doesn’t.
>>
>> If the internal representation is UTF-16 (or UTF-32), it is a likely
>> design that there is a variable into which the scalar value of the
>> current code point is accumulated during UTF-8 decoding.
>
> That’s quite a likely design with a UTF-8 internal representation too; it’s 
> just that you’d only decode during processing, as opposed to immediately at 
> input.

The time to generate the U+FFFDs is at the input time which is what's
at issue here. The later processing, which may then involve iterating
by code point and involving computing the scalar values is a different
step that should be able to assume valid UTF-8 and not be concerned
with invalid UTF-8. (To what extent different programming languages
and frameworks allow confident maintenance of the invariant that after
input all in-RAM UTF-8 can be treated as valid varies.)

>> When the internal representation is UTF-8, only UTF-8 validation is
>> needed, and it's natural to have a fail-fast validator, which *doesn't
>> necessarily need such a scalar value accumulator at all*.
>
> Sure.  But a state machine can still contain appropriate error states without 
> needing an accumulator.

As I said upthread, it could, but it seems inappropriate to ask
implementations to take on that extra complexity on as weak grounds as
"ICU does it" or "feels right" when the current recommendation doesn't
call for those extra states and the current spec is consistent with a
number of prominent non-ICU implementations, including Web browsers.

>>> In what sense is this “interop”?
>>
>> In the sense that prominent independent implementations do the same
>> externally observable thing.
>
> The argument is, I think, that in this case the thing they are doing is the 
> *wrong* thing.

It's seems weird to characterize following the currently-specced "best
practice" as "wrong" without showing a compelling fundamental flaw
(such as a genuine security problem) in the currently-specced "best
practice". With implementations of the currently-specced "best
practice" already shipped, I don't think aesthetic preferences should
be considered enough of a reason to proclaim behavior adhering to the
currently-specced "best practice" as "wrong".

>  That many of them do it would only be an argument if there was some reason 
> that it was desirable that they did it.  There doesn’t appear to be such a 
> reason, unless you can think of something that hasn’t been mentioned thus far?

I've already given a reason: UTF-8 validation code not needing to have
extra states catering to aesthetic considerations of U+FFFD
consolidation.

>  The only reason you’ve given, to date, is that they currently do that, so 
> that should be the recommended behaviour (which is little different from the 
> argument - which nobody deployed - that ICU currently does the other thing, 
> so *that* should be the recommended behaviour; the only difference is that 
> *you* care about browsers and don’t care about ICU, whereas you yourself 
> suggested that some of us might be advocating this decision because we care 
> about ICU and not about e.g. browsers).

Not just browsers. Also OpenJDK and Python 3. Do I really need to test
the standard libraries of more languages/systems to more strongly make
the case that the ICU behavior (according to the proposal PDF) is not
the norm and what the spec currently says is?

> I’ll add also that even among the implementations you cite, some of them 
> permit surrogates in their UTF-8 input (i.e. they’re actually processing 
> CESU-8, not UTF-8 anyway).  Python, for example, certainly accepts the 
> sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” 
> implementation that conformed literally to the recommendation, as you seem to 
> want, should instead replace it with *four* U+FFFDs (I think), no?

I see that behavior in Python 2. Earlier, I said that Python 3 agrees
with the current spec for my test case. The Python 2 behavior I see is
not just against "best practice" but obviously incompliant.

(For details: I tested Python 2.7.12 and 3.5.2 as shipped on Ubuntu 16.04.)

> One additional note: the standard codifies this behaviour as a 
> *recommendation*, not a requirement.

This is an odd argument in favor of changing it. If the argument is
that it's just a recommendation that you don't need to adhere to,
surely then the people who don't like the current recommendation
should choose not to adhere to it instead of advocating changing it.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag  wrote:
> but I think the way he raises this point is needlessly antagonistic.

I apologize. My level of dismay at the proposal's ICU-centricity overcame me.

On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
 wrote:
> That would be true if the in-memory representation had any effect on what 
> we’re talking about, but it really doesn’t.

If the internal representation is UTF-16 (or UTF-32), it is a likely
design that there is a variable into which the scalar value of the
current code point is accumulated during UTF-8 decoding. In such a
scenario, it can be argued as "natural" to first operate according to
the general structure of UTF-8 and then inspect what you got in the
accumulation variable (ruling out non-shortest forms, values above the
Unicode range and surrogate values after the fact).

When the internal representation is UTF-8, only UTF-8 validation is
needed, and it's natural to have a fail-fast validator, which *doesn't
necessarily need such a scalar value accumulator at all*. The
construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when
used as a UTF-8 validator is the best illustration of a UTF-8
validator not necessarily looking like a "natural" UTF-8 to UTF-16
converter at all.

>>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>>> test with three major browsers that use UTF-16 internally and have
>>> independent (of each other) implementations of UTF-8 decoding
>>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>>> Unicode standard away from that kind of interop needs *way* better
>>> rationale than "feels right”.
>
> In what sense is this “interop”?

In the sense that prominent independent implementations do the same
externally observable thing.

> Under what circumstance would it matter how many U+FFFDs you see?

Maybe it doesn't, but I don't think the burden of proof should be on
the person advocating keeping the spec and major implementations as
they are. If anything, I think those arguing for a change of the spec
in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
with the current spec should show why it's important to have a
different number of U+FFFDs than the spec's "best practice" calls for
now.

>  If you’re about to mutter something about security, consider this: security 
> code *should* refuse to compare strings that contain U+FFFD (or at least 
> should never treat them as equal, even to themselves), because it has no way 
> to know what that code point represents.

In practice, e.g. the Web Platform doesn't allow for stopping
operating on input that contains an U+FFFD, so the focus is mainly on
making sure that U+FFFDs are placed well enough to prevent bad stuff
under normal operations. At least typically, the number of U+FFFDs
doesn't matter for that purpose, but when browsers agree on the number
 of U+FFFDs, changing that number should have an overwhelmingly strong
rationale. A security reason could be a strong reason, but such a
security motivation for fewer U+FFFDs has not been shown, to my
knowledge.

> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD (1)
>
> rather than
>
>   U+FFFD   (2)

I advocate (1), most simply because that's what Firefox, Edge and
Chrome do *in accordance with the currently-recommended best practice*
and, less simply, because it makes sense in the presence of a
fail-fast UTF-8 validator. I think the burden of proof to show an
overwhelmingly good reason to change should, at this point, be on
whoever proposes doing it differently than what the current
widely-implemented spec says.

> It’s pretty clear what the intent of the encoder was there, I’d say, and 
> while we certainly don’t want to decode it as a NUL (that was the source of 
> previous security bugs, as I recall), I also don’t see the logic in insisting 
> that it must be decoded to *three* code points when it clearly only 
> represented one in the input.

As noted previously, the logic is that you generate a U+FFFD whenever
a fail-fast validator fails.

> This isn’t just a matter of “feels nicer”.  (1) is simply illogical 
> behaviour, and since behaviours (1) and (2) are both clearly out there today, 
> it makes sense to pick the more logical alternative as the official 
> recommendation.

Again, the current best practice makes perfect logical sense in the
context of a fail-fast UTF-8 validator. Moreover, it doesn't look like
both are "out there" equally when major browsers, OpenJDK and Python 3
agree. (I expect I could find more prominent implementations that
implement the currently-stated best practice, but I feel I shouldn't
have to.) From my experience from working on Web standards and
implementing them, I think it's 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen  wrote:
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome) shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".

Testing with that file, Python 3 and OpenJDK 8 agree with the
currently-specced best-practice, too. I expect there to be other
well-known implementations that comply with the currently-specced best
practice, so the rationale to change the stated best practice would
have to be very strong (as in: security problem with currently-stated
best practice) for a change to be appropriate.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 6:23 AM, Karl Williamson
<pub...@khwilliamson.com> wrote:
> On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote:
>>
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
>>
>> The proposal is to make ICU's spec violation conforming. I think there
>> is both a technical and a political reason why the proposal is a bad
>> idea.
>
>
>
> Henri's claim that "The proposal is to make ICU's spec violation conforming"
> is a false statement, and hence all further commentary based on this false
> premise is irrelevant.
>
> I believe that ICU is actually currently conforming to TUS.

Do you mean that ICU's behavior differs from what the PDF claims (I
didn't test and took the assertion in the PDF about behavior at face
value) or do you mean that despite deviating from the
currently-recommended best practice the behavior is conforming,
because the relevant part of the spec is mere best practice and not a
requirement?

> TUS has certain requirements for UTF-8 handling, and it has certain other
> "Best Practices" as detailed in 3.9.  The proposal involves changing those
> recommendations.  It does not involve changing any requirements.

Even so, I think even changing a recommendation of "best practice"
needs way better rationale than "feels right" or "ICU already does it"
when a) major browsers (which operate in the most prominent
environment of broken and hostile UTF-8) agree with the
currently-recommended best practice and b) the currently-recommended
best practice makes more sense for implementations where "UTF-8
decoding" is actually mere "UTF-8 validation".

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
 wrote:
> I’m not sure how the discussion of “which is better” relates to the
> discussion of ill-formed UTF-8 at all.

Clearly, the "which is better" issue is distracting from the
underlying issue. I'll clarify what I meant on that point and then
move on:

I acknowledge that UTF-16 as the internal memory representation is the
dominant design. However, because UTF-8 as the internal memory
representation is *such a good design* (when legacy constraits permit)
that *despite it not being the current dominant design*, I think the
Unicode Consortium should be fully supportive of UTF-8 as the internal
memory representation and not treat UTF-16 as the internal
representation as the one true way of doing things that gets
considered when speccing stuff.

I.e. I wasn't arguing against UTF-16 as the internal memory
representation (for the purposes of this thread) but trying to
motivate why the Consortium should consider "UTF-8 internally" equally
despite it not being the dominant design.

So: When a decision could go either way from the "UTF-16 internally"
perspective, but one way clearly makes more sense from the "UTF-8
internally" perspective, the "UTF-8 internally" perspective should be
decisive in *such a case*. (I think the matter at hand is such a
case.)

At the very least a proposal should discuss the impact on the "UTF-8
internally" case, which the proposal at hand doesn't do.

(Moving on to a different point.)

The matter at hand isn't, however, a new green-field (in terms of
implementations) issue to be decided but a proposed change to a
standard that has many widely-deployed implementations. Even when
observing only "UTF-16 internally" implementations, I think it would
be appropriate for the proposal to include a review of what existing
implementations, beyond ICU, do.

Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
test with three major browsers that use UTF-16 internally and have
independent (of each other) implementations of UTF-8 decoding
(Firefox, Edge and Chrome) shows agreement on the current spec: there
is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
6 on the second, 4 on the third and 6 on the last line). Changing the
Unicode standard away from that kind of interop needs *way* better
rationale than "feels right".

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton
<alast...@alastairs-place.net> wrote:
> On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode@unicode.org> 
> wrote:
>>
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
>
> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
> multiple errors there makes no sense.

The currently-specced behavior makes perfect sense when you add error
emission on top of a fail-fast UTF-8 validation state machine.

>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
>> representative of implementation concerns of implementations that use
>> UTF-8 as their in-memory Unicode representation.
>>
>> Even though there are notable systems (Win32, Java, C#, JavaScript,
>> ICU, etc.) that are stuck with UTF-16 as their in-memory
>> representation, which makes concerns of such implementation very
>> relevant, I think the Unicode Consortium should acknowledge that
>> UTF-16 was, in retrospect, a mistake
>
> You may think that.  There are those of us who do not.

My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

The proposal is to make ICU's spec violation conforming. I think there
is both a technical and a political reason why the proposal is a bad
idea.

First, the technical reason:

ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake (since Unicode grew past 16 bits
anyway making UTF-16 both variable-width *and*
ASCII-incompatible--i.e. widening the the code units to be
ASCII-incompatible didn't buy a constant-width encoding after all) and
that when the legacy constraints of Win32, Java, C#, JavaScript, ICU,
etc. don't force UTF-16 as the internal Unicode representation, using
UTF-8 as the internal Unicode representation is the technically
superior design: Using UTF-8 as the internal Unicode representation is
memory-efficient and cache-efficient when dealing with data formats
whose syntax is mostly ASCII (e.g. HTML), forces developers to handle
variable-width issues right away, makes input decode a matter of mere
validation without copy when the input is conforming and makes output
encode infinitely fast (no encode step needed).

Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.

When looking this issue from the ICU perspective of using UTF-16 as
the in-memory representation of Unicode, it's easy to consider the
proposed change as the easier thing for implementation (after all, no
change for the ICU implementation is involved!). However, when UTF-8
is the in-memory representation of Unicode and "decoding" UTF-8 input
is a matter of *validating* UTF-8, a state machine that rejects a
sequence as soon as it's impossible for the sequence to be valid UTF-8
(under the definition that excludes surrogate code points and code
points beyond U+10) makes a whole lot of sense. If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.

When the decision can easily go either way for implementations that
use UTF-16 internally but the options are not equal when using UTF-8
internally, the "UTF-8 internally" case should be decisive.
(Especially when spec-wise that decision involves no change. I further
note the proposal PDF argues on the level of "feels right" without
even discussing the impact on implementations that use UTF-8
internally.)

As a matter of implementation experience, the implementation I've
written (https://github.com/hsivonen/encoding_rs) supports both the
UTF-16 as the in-memory Unicode representation and the UTF-8 as the
in-memory Unicode representation scenarios, and the fail-fast
requirement wasn't onerous in the UTF-16 as the in-memory
representation scenario.

Second, the political reason:

Now that ICU is a Unicode Consortium project, I think the Unicode
Consortium should be particular sensitive to biases arising from being
both the source of the spec and the source of a popular
implementation. It looks *really bad* both in terms of equal footing
of ICU vs. other implementations for the purpose of how the standard
is developed as well as the reliability of the standard text vs. ICU
source code as the source of truth that other implementors need to pay
attention to if the way the Unicode Consortium resolves a discrepancy
between ICU behavior and a well-known spec provision (this isn't some
ill-known corner case, after all) is by changing the spec instead of
changing ICU *especially* when the change is not neutral for
implementations that have made different but completely valid per
then-existing spec and, in the absence of legacy constraints, superior
architectural choices compared to ICU (i.e. UTF-8 internally instead
of UTF-16 internally).

I can see the irony of this viewpoint coming from a WHATWG-aligned
browser developer, but I note that even browsers that use ICU for
legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior
isn't, in fact, the dominant browser UTF-8 behavior. That is, even
Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the
environment that's the most sensitive to how issues like