subject:"Re\: Feedback on the proposal to change U\+FFFD generation when decoding ill\-formed UTF\-8"

> And *that* is what the specification says.  The whole problem here is that 
> someone elevated
> one choice to the status of “best practice”, and it’s a choice that some of 
> us don’t think *should*
> be considered best practice.

> Perhaps “best practice” should simply be altered to say that you *clearly 
> document* your behavior
> in the case of invalid UTF-8 sequences, and that code should not rely on the 
> number of U+FFFDs 
> generated, rather than suggesting a behaviour?

That's what I've been suggesting.

I think we could maybe go a little further though:

* Best practice is clearly not to depend on the # of U+FFFDs generated by 
another component/app.  Clearly that can't be relied upon, so I think everyone 
can agree with that.
* I think encouraging documentation of behavior is cool, though there are 
probably low priority bugs and people don't like to read the docs in that 
detail, so I wouldn't expect very much from that.
* As far as I can tell, there are two (maybe three) sane approaches to this 
problem:
* Either a "maximal" emission of one U+FFFD for every byte that exists 
outside of a good sequence 
* Or a "minimal" version that presumes the lead byte was counting trail 
bytes correctly even if the resulting sequence was invalid.  In that case just 
use one U+FFFD.
* And (maybe, I haven't heard folks arguing for this one) emit one 
U+FFFD at the first garbage byte and then ignore the input until valid data 
starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred 
garbage bytes as long as there weren't any valid sequences within that group).
* I'd be happy if the best practice encouraged one of those two (or maybe 
three) approaches.  I think an approach that called rand() to see how many 
U+FFFDs to emit when it encountered bad data is fair to discourage.

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Doug Ewell via Unicode

Henri Sivonen wrote:

> If anything, I hope this thread results in the establishment of a
> requirement for proposals to come with proper research about what
> multiple prominent implementations to about the subject matter of a
> proposal concerning changes to text about implementation behavior.

Considering that several folks have objected that the U+FFFD
recommendation is perceived as having the weight of a requirement, I
think adding Henri's good advice above as a "requirement" seems
heavy-handed. Who will judge how much research qualifies as "proper"?
Who will determine that the judge doesn't have a conflict?

An alternative would be to require that proposals, once received with
whatever amount of research, are augmented with any necessary additional
research *before* being approved. The identity or reputation of the
requester should be irrelevant to approval.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> it’s more meaningful for whoever sees the output to see a single U+FFFD 
> representing 
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid 
> lead byte and 
> then another for an “unexpected” trailing byte.

I disagree.  It may be more meaningful for some applications to have a single 
U+FFFD representing an illegally encoded 2-byte NULL than to have 2 U+FFFDs.  
Of course then you don't know if it was an illegally encoded 2-byte NULL or an 
illegally encoded 3-byte NULL or whatever, so some information that other 
applications may be interested in is lost.

Personally, I prefer the "emit a U+FFFD if the sequence is invalid, drop the 
byte, and try again" approach.  

-Shawn

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> For implementations that emit FFFD while handling text conversion and repair 
> (ie, converting ill-formed
> UTF-8 to well-formed), it is best for interoperability if they get the same 
> results, so that indices within the
> resulting strings are consistent across implementations for all the correct 
> characters thereafter.

That seems optimistic :)

If interoperability is the goal, then it would seem to me that changing the 
recommendation would be contrary to that goal.  There are systems that will not 
or cannot change to a new recommendation.  If such systems are updated, then 
adoption of those systems will likely take some time.

In other words, I cannot see where “consistency across implementations” would 
be achievable anytime in the near future.

It seems to me that being able to use a data stream of ambiguous quality in 
another application with predictable results, then that stream should be 
“repaired” prior to being handed over.  Then both endpoints would be using the 
same set of FFFDs, whether that was single or multiple forms.


-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Mark Davis ☕️ via Unicode

> I do not understand the energy being invested in a case that shouldn't
happen, especially in a case that is a subset of all the other bad cases
that could happen.

I think Richard stated the most compelling reason:

… The bug you mentioned arose from two different ways of
counting the string length in 'characters'.  Having two different
'character' counts for the same string is inviting trouble.


For implementations that emit FFFD while handling text conversion and
repair (ie, converting ill-formed UTF-8 to well-formed), it is best for
interoperability if they get the same results, so that indices within the
resulting strings are consistent across implementations for all the
*correct* characters thereafter.

It would be preferable *not* to have the following:

source = %c0%80abc

Vendor 1:
fixed = fix(source)
fixed == �abc
codepointAt(fixed, 3) == 'b'

Vendor2:
fixed = fix(source)
fixed == ��abc
codepointAt(fixed, 3) =
=
'
c
'

In theory one could just throw an exception. In practice, nobody wants
their browser

to belly up on a webpage with a component that has an ill-formed bit of
UTF-8.

I
n theory one could document everyone's flavor of the month for how many
FFFD's to emit. In practice, that falls apart immediately, since in today's
interconnected world you can't tell which processes get first crack at text
repair.

Mark

On Wed, May 31, 2017 at 7:43 PM, Shawn Steele via Unicode <
unicode@unicode.org> wrote:

> > > In either case, the bad characters are garbage, so neither approach is
> > > "better" - except that one or the other may be more conducive to the
> > > requirements of the particular API/application.
>
> > There's a potential issue with input methods that indirectly edit the
> backing store.  For example,
> > GTK input methods (e.g. function gtk_im_context_delete_surrounding())
> can delete an amount
> > of text specified in characters, not storage units.  (Deletion by
> storage units is not available in this
> > interface.)  This might cause utter confusion or worse if the backing
> store starts out corrupt.
> > A corrupt backing store is normally manually correctable if most of the
> text is ASCII.
>
> I think that's sort of what I said: some approaches might work better for
> some systems and another approach might work better for another system.
> This also presupposes a corrupt store.
>
> It is unclear to me what the expected behavior would be for this
> corruption if, for example, there were merely a half dozen 0x80 in the
> middle of ASCII text?  Is that garbage a single "character"?  Perhaps
> because it's a consecutive string of bad bytes?  Or should it be 6
> characters since they're nonsense?  Or maybe 2 characters because the
> maximum # of trail bytes we can have is 3?
>
> What if it were 2 consecutive 2-byte sequence lead bytes and no trail
> bytes?
>
> I can see how different implementations might be able to come up with
> "rules" that would help them navigate (or clean up) those minefields,
> however it is not at all clear to me that there is a "best practice" for
> those situations.
>
> There also appears to be a special weight given to non-minimally-encoded
> sequences.  It would seem to me that none of these illegal sequences should
> appear in practice, so we have either:
>
> * A bad encoder spewing out garbage (overlong sequences)
> * Flipped bit(s) due to storage/transmission/whatever errors
> * Lost byte(s) due to storage/transmission/coding/whatever errors
> * Extra byte(s) due to whatever errors
> * Bad string manipulation breaking/concatenating in the middle of
> sequences, causing garbage (perhaps one of the above 2 codeing errors).
>
> Only in the first case, of a bad encoder, are the overlong sequences
> actually "real".  And that shouldn't happen (it's a bad encoder after
> all).  The other scenarios seem just as likely, (or, IMO, much more likely)
> than a badly designed encoder creating overlong sequences that appear to
> fit the UTF-8 pattern but aren't actually UTF-8.
>
> The other cases are going to cause byte patterns that are less "obvious"
> about how they should be navigated for various applications.
>
> I do not understand the energy being invested in a case that shouldn't
> happen, especially in a case that is a subset of all the other bad cases
> that could happen.
>
> -Shawn
>
>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode

On 31 May 2017, at 18:43, Shawn Steele via Unicode  wrote:
> 
> It is unclear to me what the expected behavior would be for this corruption 
> if, for example, there were merely a half dozen 0x80 in the middle of ASCII 
> text?  Is that garbage a single "character"?  Perhaps because it's a 
> consecutive string of bad bytes?  Or should it be 6 characters since they're 
> nonsense?  Or maybe 2 characters because the maximum # of trail bytes we can 
> have is 3?

It should be six U+FFFD characters, because 0x80 is not a lead byte.  
Basically, the new proposal is that we should decode bytes that structurally 
match UTF-8, and if the encoding is then illegal (because it’s over-long, 
because it’s a surrogate or because it’s over U+10) then the entire thing 
is replaced with U+FFFD.  If, on the other hand, we get a sequence that isn’t 
structurally valid UTF-8, we replace the maximally *structurally* valid subpart 
with U+FFFD and continue.

> What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?

Then you get two U+FFFDs.

> I can see how different implementations might be able to come up with "rules" 
> that would help them navigate (or clean up) those minefields, however it is 
> not at all clear to me that there is a "best practice" for those situations.

I’m not sure the whole “best practice” thing has been a lot of help here.  
Perhaps we should change it to say “Suggested Handling”, to make quite clear 
that filing a bug report against code that chooses some other option is not 
necessary?

> There also appears to be a special weight given to non-minimally-encoded 
> sequences.

I don’t think that’s true, *although* it *is* true that UTF-8 decoders 
historically tended to allow such things, so one might assume that some 
software out there is generating them for whatever reason.

There are also *deliberate* violations of the minimal length encoding 
specification in some cases (for instance to allow NUL to be encoded in such a 
way that it won’t terminate a C-style string).  Yes, you may retort, that isn’t 
“valid UTF-8”.  Sure.  It *is* useful, though, and it is *in use*.  If a UTF-8 
decoder encounters such a thing, it’s more meaningful for whoever sees the 
output to see a single U+FFFD representing the illegally encoded NUL that it is 
to see two U+FFFDs, one for an invalid lead byte and then another for an 
“unexpected” trailing byte.

Likewise, there are encoders that generate surrogates in UTF-8, which is, of 
course, illegal, but *does* happen.  Again, they can provide reasonable 
justifications for their behaviour (typically they want the default binary sort 
to work the same as for UTF-16 for some reason), and again, replacing a single 
surrogate with U+FFFD rather than multiple U+FFFDs is more helpful to 
whoever/whatever ends up seeing it.

And, of course, there are encoders that are attempting to exploit security 
flaws, which will very definitely generate these kinds of things.

>  It would seem to me that none of these illegal sequences should appear in 
> practice, so we have either:
> 
> * A bad encoder spewing out garbage (overlong sequences)
> * Flipped bit(s) due to storage/transmission/whatever errors
> * Lost byte(s) due to storage/transmission/coding/whatever errors
> * Extra byte(s) due to whatever errors
> * Bad string manipulation breaking/concatenating in the middle of sequences, 
> causing garbage (perhaps one of the above 2 codeing errors).

I see no reason to suppose that the proposed behaviour would function any less 
well in those cases.

> Only in the first case, of a bad encoder, are the overlong sequences actually 
> "real".  And that shouldn't happen (it's a bad encoder after all).

Except some encoders *deliberately* use over-longs, and one would assume that 
since UTF-8 decoders historically allowed this, there will be data “in the 
wild” that has this form.

> The other scenarios seem just as likely, (or, IMO, much more likely) than a 
> badly designed encoder creating overlong sequences that appear to fit the 
> UTF-8 pattern but aren't actually UTF-8.

I’m not sure I agree that flipped bits, lost bytes and extra bytes are more 
likely than a “bad” encoder.  Bad string manipulation is of course prevalent, 
though - there’s no way around that.

> The other cases are going to cause byte patterns that are less "obvious" 
> about how they should be navigated for various applications.

This is true, *however* the new proposed behaviour is in no way inferior to the 
old proposed behaviour in those cases - it’s just different.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode

> On 30 May 2017, at 18:11, Shawn Steele via Unicode  
> wrote:
> 
>> Which is to completely reverse the current recommendation in Unicode 9.0. 
>> While I agree that this might help you fending off a bug report, it would 
>> create chances for bug reports for Ruby, Python3, many if not all Web 
>> browsers,...
> 
> & Windows & .Net
> 
> Changing the behavior of the Windows / .Net SDK is a non-starter.
> 
>> Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows 
>> what it means, but everybody knows they don't exist.
> 
> Yes, this is trying to improve the language for a scenario that CANNOT 
> HAPPEN.  We're trying to optimize a case for data that implementations should 
> never encounter.  It is sort of exactly like optimizing for the case where 
> your data input is actually a dragon and not UTF-8 text.  
> 
> Since it is illegal, then the "at least 1 FFFD but as many as you want to 
> emit (or just fail)" is fine.

And *that* is what the specification says.  The whole problem here is that 
someone elevated one choice to the status of “best practice”, and it’s a choice 
that some of us don’t think *should* be considered best practice.

Perhaps “best practice” should simply be altered to say that you *clearly 
document* your behaviour in the case of invalid UTF-8 sequences, and that code 
should not rely on the number of U+FFFDs generated, rather than suggesting a 
behaviour?

Kind regards,

Alastair.

--
http://alastairs-place.net

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> > In either case, the bad characters are garbage, so neither approach is 
> > "better" - except that one or the other may be more conducive to the 
> > requirements of the particular API/application.

> There's a potential issue with input methods that indirectly edit the backing 
> store.  For example,
> GTK input methods (e.g. function gtk_im_context_delete_surrounding()) can 
> delete an amount 
> of text specified in characters, not storage units.  (Deletion by storage 
> units is not available in this
> interface.)  This might cause utter confusion or worse if the backing store 
> starts out corrupt. 
> A corrupt backing store is normally manually correctable if most of the text 
> is ASCII.

I think that's sort of what I said: some approaches might work better for some 
systems and another approach might work better for another system.  This also 
presupposes a corrupt store.

It is unclear to me what the expected behavior would be for this corruption if, 
for example, there were merely a half dozen 0x80 in the middle of ASCII text?  
Is that garbage a single "character"?  Perhaps because it's a consecutive 
string of bad bytes?  Or should it be 6 characters since they're nonsense?  Or 
maybe 2 characters because the maximum # of trail bytes we can have is 3?

What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?

I can see how different implementations might be able to come up with "rules" 
that would help them navigate (or clean up) those minefields, however it is not 
at all clear to me that there is a "best practice" for those situations.

There also appears to be a special weight given to non-minimally-encoded 
sequences.  It would seem to me that none of these illegal sequences should 
appear in practice, so we have either:

* A bad encoder spewing out garbage (overlong sequences)
* Flipped bit(s) due to storage/transmission/whatever errors
* Lost byte(s) due to storage/transmission/coding/whatever errors
* Extra byte(s) due to whatever errors
* Bad string manipulation breaking/concatenating in the middle of sequences, 
causing garbage (perhaps one of the above 2 codeing errors).

Only in the first case, of a bad encoder, are the overlong sequences actually 
"real".  And that shouldn't happen (it's a bad encoder after all).  The other 
scenarios seem just as likely, (or, IMO, much more likely) than a badly 
designed encoder creating overlong sequences that appear to fit the UTF-8 
pattern but aren't actually UTF-8.

The other cases are going to cause byte patterns that are less "obvious" about 
how they should be navigated for various applications.

I do not understand the energy being invested in a case that shouldn't happen, 
especially in a case that is a subset of all the other bad cases that could 
happen.

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode

On Wed, 31 May 2017 15:12:12 +0300
Henri Sivonen via Unicode  wrote:

> The write-up mentions
> https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd
> like to draw everyone's attention to that bug, which is real-world
> evidence of a bug arising from two UTF-8 decoders within one product
> handling UTF-8 errors differently.

> Does it matter if a proposal/appeal is submitted as a non-member
> implementor person, as an individual person member or as a liaison
> member? http://www.unicode.org/consortium/liaison-members.html list
> "the Mozilla Project" as a liaison member, but Mozilla-side
> conventions make submitting proposals like this "as Mozilla"
> problematic (we tend to avoid "as Mozilla" statements on technical
> standardization fora except when the W3C Process forces us to make
> them as part of charter or Proposed Recommendation review).

There may well be an advantage to being able to answer any questions on
the proposal at the meeting, especially if it isn't read until the
meeting.

> > The modified text is a set of guidelines, not requirements. So no
> > conformance clause is being changed.  
> 
> I'm aware of this.
> 
> > If people really believed that the guidelines in that section
> > should have been conformance clauses, they should have proposed
> > that at some point.  
> 
> It seems to me that this thread does not support the conclusion that
> the Unicode Standard's expression of preference for the number of
> REPLACEMENT CHARACTERs should be made into a conformance requirement
> in the Unicode Standard. This thread could be taken to support a
> conclusion that the Unicode Standard should not express any preference
> beyond "at least one and at most as many as there were bytes".
> 
> On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode
>  wrote:
> >  In any case, Henri is complaining that it’s too difficult to
> > implement; it isn’t.  You need two extra states, both of which are
> > trivial.  
> 
> I am not claiming it's too difficult to implement. I think it
> inappropriate to ask implementations, even from-scratch ones, to take
> on added complexity in error handling on mere aesthetic grounds. Also,
> I think it's inappropriate to induce implementations already written
> according to the previous guidance to change (and risk bugs) or to
> make the developers who followed the previous guidance with precision
> be the ones who need to explain why they aren't following the new
> guidance.

How straightforward is the FSM for back-stepping?

> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
>  wrote:
> > The UTF-8 conversion code that I wrote for ICU, and apparently the
> > code that various other people have written, collects sequences
> > starting from lead bytes, according to the original spec, and at
> > the end looks at whether the assembled code point is too low for
> > the lead byte, or is a surrogate, or is above 10. Stopping at a
> > non-trail byte is quite natural, and reading the PRI text
> > accordingly is quite natural too.  
> 
> I don't doubt that other people have written code with the same
> concept as ICU, but as far as non-shortest form handling goes in the
> implementations I tested (see URL at the start of this email) ICU is
> the lone outlier.

You should have researched implementations as they were in 2007.

My own code uses the same concept as Markus's ICU code - convert and
check the resulting value is legal for the length.  As a check,
remember that for n > 1, n bytes could represent 2**(5n + 1) values if
overlongs were permitted.

> > Aside from UTF-8 history, there is a reason for preferring a more
> > "structural" definition for UTF-8 over one purely along valid
> > sequences. This applies to code that *works* on UTF-8 strings
> > rather than just converting them. For UTF-8 *processing* you need
> > to be able to iterate both forward and backward, and sometimes you
> > need not collect code points while skipping over n units in either
> > direction -- but your iteration needs to be consistent in all
> > cases. This is easier to implement (especially in fast, short,
> > inline code) if you have to look only at how many trail bytes
> > follow a lead byte, without having to look whether the first trail
> > byte is in a certain range for some specific lead bytes.  
> 
> But the matter at hand is decoding potentially-invalid UTF-8 input
> into a valid in-memory Unicode representation, so later processing is
> somewhat a red herring as being out of scope for this step.

No.  Both lossily converting a UTF-8-like string as a stream of bytes to
scalar values and moving back and forth through the string 'character'
by 'character' imply an ability to count the number of 'characters' in
the string.  The bug you mentioned arose from two different ways of
counting the string length in 'characters'.  Having two different
'character' counts for the same string is

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Henri Sivonen via Unicode

I've researched this more. While the old advice dominates the handling
of non-shortest forms, there is more variation than I previously
thought when it comes to truncated sequences and CESU-8-style
surrogates. Still, the ICU behavior is an outlier considering the set
of implementations that I tested.

I've written up my findings at https://hsivonen.fi/broken-utf-8/

The write-up mentions
https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd
like to draw everyone's attention to that bug, which is real-world
evidence of a bug arising from two UTF-8 decoders within one product
handling UTF-8 errors differently.

On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode
 wrote:
> There is plenty of time for public comment, since it was targeted at Unicode
> 11, the release for about a year from now, not Unicode 10, due this year.
> When the UTC "approves a change", that change is subject to comment, and the
> UTC can always reverse or modify its approval up until the meeting before
> release date. So there are ca. 9 months in which to comment.

What should I read to learn how to formulate an appeal correctly?

Does it matter if a proposal/appeal is submitted as a non-member
implementor person, as an individual person member or as a liaison
member? http://www.unicode.org/consortium/liaison-members.html list
"the Mozilla Project" as a liaison member, but Mozilla-side
conventions make submitting proposals like this "as Mozilla"
problematic (we tend to avoid "as Mozilla" statements on technical
standardization fora except when the W3C Process forces us to make
them as part of charter or Proposed Recommendation review).

> The modified text is a set of guidelines, not requirements. So no
> conformance clause is being changed.

I'm aware of this.

> If people really believed that the guidelines in that section should have
> been conformance clauses, they should have proposed that at some point.

It seems to me that this thread does not support the conclusion that
the Unicode Standard's expression of preference for the number of
REPLACEMENT CHARACTERs should be made into a conformance requirement
in the Unicode Standard. This thread could be taken to support a
conclusion that the Unicode Standard should not express any preference
beyond "at least one and at most as many as there were bytes".

On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode
 wrote:
>  In any case, Henri is complaining that it’s too difficult to implement; it 
> isn’t.  You need two extra states, both of which are trivial.

I am not claiming it's too difficult to implement. I think it
inappropriate to ask implementations, even from-scratch ones, to take
on added complexity in error handling on mere aesthetic grounds. Also,
I think it's inappropriate to induce implementations already written
according to the previous guidance to change (and risk bugs) or to
make the developers who followed the previous guidance with precision
be the ones who need to explain why they aren't following the new
guidance.

On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
 wrote:
> The UTF-8 conversion code that I wrote for ICU, and apparently the code that
> various other people have written, collects sequences starting from lead
> bytes, according to the original spec, and at the end looks at whether the
> assembled code point is too low for the lead byte, or is a surrogate, or is
> above 10. Stopping at a non-trail byte is quite natural, and reading the
> PRI text accordingly is quite natural too.

I don't doubt that other people have written code with the same
concept as ICU, but as far as non-shortest form handling goes in the
implementations I tested (see URL at the start of this email) ICU is
the lone outlier.

> Aside from UTF-8 history, there is a reason for preferring a more
> "structural" definition for UTF-8 over one purely along valid sequences.
> This applies to code that *works* on UTF-8 strings rather than just
> converting them. For UTF-8 *processing* you need to be able to iterate both
> forward and backward, and sometimes you need not collect code points while
> skipping over n units in either direction -- but your iteration needs to be
> consistent in all cases. This is easier to implement (especially in fast,
> short, inline code) if you have to look only at how many trail bytes follow
> a lead byte, without having to look whether the first trail byte is in a
> certain range for some specific lead bytes.

But the matter at hand is decoding potentially-invalid UTF-8 input
into a valid in-memory Unicode representation, so later processing is
somewhat a red herring as being out of scope for this step. I do agree
that if you already know that the data is valid UTF-8, it makes sense
to work from the bit pattern definition only. (E.g. in encoding_rs,
the implementation I've written and that's on track to replacing uconv
in Firefox, UTF-8 decode works

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode

On Fri, 26 May 2017 21:41:49 +
Shawn Steele via Unicode  wrote:

> I totally get the forward/backward scanning in sync without decoding
> reasoning for some implementations, however I do not think that the
> practices that benefit those should extend to other applications that
> are happy with a different practice.

> In either case, the bad characters are garbage, so neither approach
> is "better" - except that one or the other may be more conducive to
> the requirements of the particular API/application.

There's a potential issue with input methods that indirectly edit the
backing store.  For example, GTK input methods (e.g. function
gtk_im_context_delete_surrounding()) can delete an amount of text
specified in characters, not storage units.  (Deletion by storage
units is not available in this interface.)  This might cause utter
confusion or worse if the backing store starts out corrupt.  A corrupt
backing store is normally manually correctable if most of the text is
ASCII.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Richard Wordingham via Unicode

On Tue, 30 May 2017 16:38:45 -0600
Karl Williamson via Unicode  wrote:

> Under Best Practices, how many REPLACEMENT CHARACTERs should the 
> sequence  generate?  0, 1, 2, 3, 4 ?
> 
> In practice, how many do parsers generate?

See Markus Kuhn's test page
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, test
5.1.5.  Firefox generates three replacement characters.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Richard Wordingham via Unicode

On Fri, 26 May 2017 11:22:37 -0700
Ken Whistler via Unicode  wrote:

> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
> > The link provided about the PRI doesn't lead to the comments.
> >  
> 
> PRI #121 (August, 2008) pre-dated the practice of keeping all the 
> feedback comments together with the PRI itself in a numbered
> directory with the name "feedback.html". But the comments were
> collected together at the time and are accessible here:
> 
> http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121
> 
> Also there was a separately submitted comment document:
> 
> http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt
> 
> And the minutes of the pertinent UTC meeting (UTC #116):
> 
> http://www.unicode.org/L2/L2008/08253.htm
> 
> The minutes simply capture the consensus to adopt Option #2 from PRI 
> #121, and the relevant action items.

For Unicode members, there is also the original Unicore thread, which
starts at
http://www.unicode.org/mail-arch/unicore-ml/y2008-m04/0091.html .

(I couldn't find anything on the general list.)

There were objections there to replacing non-shortest form sequences by
multiple ocurrences of U+FFFD.  They were rejected by those that
mattered, and so the option of a single U+FFFD was not included in the
PRI.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode

That's not at all the same as saying it was a valid sequence. That's saying 
decoders were allowed to be lenient with invalid sequences.
We're supposed to be comfortable with standards language here. Do we really not 
understand this distinction?


--Doug Ewell | Thornton, CO, US | ewellic.org
 Original message From: Karl Williamson 
<pub...@khwilliamson.com> Date: 5/30/17  16:32  (GMT-07:00) To: Doug Ewell 
<d...@ewellic.org>, Unicode Mailing List <unicode@unicode.org> Subject: Re: 
Feedback on the proposal to change U+FFFD generation when
  decoding ill-formed UTF-8 
On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote:
> L2/17-168 says:
> 
> "For UTF-8, recommend evaluating maximal subsequences based on the
> original structural definition of UTF-8, without ever restricting trail
> bytes to less than 80..BF. For example:  is a single maximal
> subsequence because C0 was originally a lead byte for two-byte
> sequences."
> 
> When was it ever true that C0 was a valid lead byte? And what does that
> have to do with (not) restricting trail bytes?

Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence
   as U+002F.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode

> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence  
> as U+002F.

Sort of, maybe.  It was not legal for them to generate it though.  So you could 
kind of infer that it was not a legal sequence.

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode

Under Best Practices, how many REPLACEMENT CHARACTERs should the 
sequence  generate?  0, 1, 2, 3, 4 ?


In practice, how many do parsers generate?

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode


On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote:

L2/17-168 says:

"For UTF-8, recommend evaluating maximal subsequences based on the
original structural definition of UTF-8, without ever restricting trail
bytes to less than 80..BF. For example:  is a single maximal
subsequence because C0 was originally a lead byte for two-byte
sequences."

When was it ever true that C0 was a valid lead byte? And what does that
have to do with (not) restricting trail bytes?


Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence
  as U+002F.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode

L2/17-168 says:

"For UTF-8, recommend evaluating maximal subsequences based on the
original structural definition of UTF-8, without ever restricting trail
bytes to less than 80..BF. For example:  is a single maximal
subsequence because C0 was originally a lead byte for two-byte
sequences."

When was it ever true that C0 was a valid lead byte? And what does that
have to do with (not) restricting trail bytes?
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode

> Which is to completely reverse the current recommendation in Unicode 9.0. 
> While I agree that this might help you fending off a bug report, it would 
> create chances for bug reports for Ruby, Python3, many if not all Web 
> browsers,...

& Windows & .Net

Changing the behavior of the Windows / .Net SDK is a non-starter.

> Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows 
> what it means, but everybody knows they don't exist.

Yes, this is trying to improve the language for a scenario that CANNOT HAPPEN.  
We're trying to optimize a case for data that implementations should never 
encounter.  It is sort of exactly like optimizing for the case where your data 
input is actually a dragon and not UTF-8 text.  

Since it is illegal, then the "at least 1 FFFD but as many as you want to emit 
(or just fail)" is fine.

-Shawn

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode

> I think nobody is debating that this is *one way* to do things, and that some 
> code does it.

Except that they sort of are.  The premise is that the "old language was 
wrong", and the "new language is right."  The reason we know the old language 
was wrong was that there was a bug filed against an implementation because it 
did not conform to the old language.  The response to the application bug was 
to change the standard's recommendation.

If this language is adopted, then the opposite is going to happen:  Bugs will 
be filed against applications that conform to the old recommendation and not 
the new recommendation.  They will say "your code could be better, it is not 
following the recommendation."  Eventually that will escalate to some level 
that it will need to be considered, however, regardless of the improvements, it 
will be a "breaking change".

Changing code from one recommendation to another will change behavior.  For 
applications or SDKs with enough visibility, that will break *someone* because 
that's how these things work.  For applications that choose not to change, in 
response to some RFP, someone's going to say "you don't fully conform to 
Unicode, we'll go with a different vendor."  Not saying that these things make 
sense, that's just the way the world works.

In some situations, one form is better, in some cases another form is better.  
If the intent is truly that there is not "one way to do things," then the 
language should reflect that.

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode


Hello Karl, others,

On 2017/05/27 06:15, Karl Williamson via Unicode wrote:

On 05/26/2017 12:22 PM, Ken Whistler wrote:


On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:

The link provided about the PRI doesn't lead to the comments.



PRI #121 (August, 2008) pre-dated the practice of keeping all the 
feedback comments together with the PRI itself in a numbered directory 
with the name "feedback.html". But the comments were collected 
together at the time and are accessible here:


http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm

The minutes simply capture the consensus to adopt Option #2 from PRI 
#121, and the relevant action items.


I now return the floor to the distinguished disputants to continue 
litigating history. ;-)


--Ken




The reason this discussion got started was that in December, someone 
came to me and said the code I support does not follow Unicode best 
practices, and suggested I need to change, though no ticket (yet) has 
been filed.  I was surprised, and posted a query to this list about what 
the advantages of the new approach are.


Can you provide a reference to that discussion? I might have missed it 
in December.


There were a number of replies, 
but I did not see anything that seemed definitive.  After a month, I 
created a ticket in Unicode and Markus was assigned to research it, and 
came up with the proposal currently being debated.


Which is to completely reverse the current recommendation in Unicode 
9.0. While I agree that this might help you fending off a bug report, it 
would create chances for bug reports for Ruby, Python3, many if not all 
Web browsers,...



Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print.


In standards, the "fine print" matters.

That seems to be borne out by Markus, even with his stake in ICU, 
supporting option #2.


Well, at http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121, I 
also supported option 2, with code behind it.


Looking at the comments, I don't see any discussion of the effect of 
this on overlong treatments.  My guess is that the effect change was 
unintentional.


I agree that it was probably not considered explicitly. But overlongs 
were disallowed for security reasons, and once the definition of UTF-8 
was tightened, "overlongs" essentially did not exist anymore. 
Essentially, "overlong" is a word like "dragon" or "ghost": Everybody 
knows what it means, but everybody knows they don't exist.


[Just to be sure, by the above, I don't mean that a sequence such as
C0 B0 cannot appear somewhere in some input. But C0 is not UTF-8 all by 
itself, and there is no need to see C0 B0 as a (ghost) sequence.]



So I have code that handled overlongs in the only correct way possible 
when they were acceptable,


No. As long as they were acceptable, they wouldn't have been replaced by 
an FFFD.



and in the obvious way after they became illegal,


Why? A change was necessary from producing an actual character to 
producing some number of FFFDs. It may have been easier to produce just 
a single FFFD, but that depends on how the code was organized.


and now without apparent discussion (which is very much akin to 
"flimsy reasons"), it suddenly was no longer "best practice".


Not 'now', but almost 9 years ago. And not "without apparent 
discussion", but with an explicit PRI.


And that 
change came "rather late in the game".  That this escaped notice for 
years indicates that the specifics of REPLACEMENT CHAR handling don't 
matter all that much.


I agree. You haven't even yet received a ticket yet.


To cut to the chase, I think Unicode should issue a Corrigendum to the 
effect that it was never the intent of this change to say that treating 
overlongs as a single unit isn't best practice.  I'm not sure this 
warrants a full-fledge Corrigendum, though.  But I believe the text of 
the best practices should indicate that treating overlongs as a single 
unit is just as acceptable as Martin's interpretation.


I'd essentially be fine with that, under the condition that the current 
recommendation is maintained as a clearly identified recommendation, so 
that Python3, Ruby, Web standards and browsers, and so on can easily 
refer to it.


Regards,   Martin.

I believe this is pretty much in line with Shawn's position.  Certainly, 
a discussion of the reasons one might choose one interpretation over 
another should be included in TUS.  That would likely have satisfied my 
original query, which hence would never have been posted.

.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode


Hello Markus, others,

On 2017/05/27 00:41, Markus Scherer wrote:

On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst 
wrote:


But there's plenty in the text that makes it absolutely clear that some
things cannot be included. In particular, it says




The term “maximal subpart of an ill-formed subsequence” refers to the code
units that were collected in this manner. They could be the start of a
well-formed sequence, except that the sequence lacks the proper
continuation. Alternatively, the converter may have found an continuation
code unit, which cannot be the start of a well-formed sequence.




And the "in this manner" refers to:



A sequence of code units will be processed up to the point where the
sequence either can be unambiguously interpreted as a particular Unicode
code point or where the converter recognizes that the code units collected
so far constitute an ill-formed subsequence.




So we have the same thing twice: Bail out as soon as something is
ill-formed.



The UTF-8 conversion code that I wrote for ICU, and apparently the code
that various other people have written, collects sequences starting from
lead bytes, according to the original spec, and at the end looks at whether
the assembled code point is too low for the lead byte, or is a surrogate,
or is above 10. Stopping at a non-trail byte is quite natural,


I think nobody is debating that this is *one way* to do things, and that 
some code does it.



and
reading the PRI text accordingly is quite natural too.


So you are claiming that you're covered because you produce an FFFD 
"where the converter recognizes that the code units collected so far 
constitute an ill-formed subsequence", except that your converter is a 
bit slow in doing that recognition?


Well, I guess I could come up with another converter that would be even 
slower at recognizing that the code units collected so far constitute an 
ill-formed subsequence. Would that still be okay in your view?


And please note that your "just a bit slow" interpretation might somehow 
work for Unicode 5.2, but it doesn't work for Unicode 9.0, because over 
the years, things have been tightened up, and the standard now makes it 
perfectly clear that C0 by itself is a maximal subpart of an ill-formed 
subsequence. From Section 3.9 of 
http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf:



Applying the definition of maximal subparts
for these ill-formed subsequences, in the first case  is a maximal
subpart, because that byte value can never be the first byte of a 
well-formed UTF-8 sequence.





Aside from UTF-8 history, there is a reason for preferring a more
"structural" definition for UTF-8 over one purely along valid sequences.


There may be all kinds of reasons for doing things one way or another. 
But there are good reasons why the current recommendation is in place, 
and there are even better reasons for not suddenly reversing it to 
something completely different.




This applies to code that *works* on UTF-8 strings rather than just
converting them. For UTF-8 *processing* you need to be able to iterate both
forward and backward, and sometimes you need not collect code points while
skipping over n units in either direction -- but your iteration needs to be
consistent in all cases. This is easier to implement (especially in fast,
short, inline code) if you have to look only at how many trail bytes follow
a lead byte, without having to look whether the first trail byte is in a
certain range for some specific lead bytes.

(And don't say that everyone can validate all strings once and then all
code can assume they are valid: That just does not work for library code,
you cannot assume anything about your input strings, and you cannot crash
when they are ill-formed.)


[rest of mail mostly OT]

Well, different libraries may make different choices. As an example, the 
Ruby programming language does essentially that: Whenever it finds an 
invalid string, it raises an exception.


Not all processing on all kinds of invalid strings immediately raises an 
exception (because of efficiency considerations). But there are quite 
strong expectations that this happens soon. As an example, when I 
extended case conversion from ASCII only to Unicode (see e.g. 
http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/, 
http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/), I had to go back 
and fix some things because there were explicit tests checking that 
invalid inputs would raise exceptions.


At least for Ruby, this policy of catching problems early rather than 
allowing garbage-in-garbage-out has worked well.




markus


Regards,   Martin.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Shawn Steele via Unicode

So basically this came about because code got bugged for not following the 
"recommendation."   To fix that, the recommendation will be changed.  However 
then that is going to lead to bugs for other existing code that does not follow 
the new recommendation.

I totally get the forward/backward scanning in sync without decoding reasoning 
for some implementations, however I do not think that the practices that 
benefit those should extend to other applications that are happy with a 
different practice.

In either case, the bad characters are garbage, so neither approach is "better" 
- except that one or the other may be more conducive to the requirements of the 
particular API/application.

I really think the correct approach here is to allow any number of replacement 
characters without prejudice.  Perhaps with suggestions for pros and cons of 
various approaches if people feel that is really necessary.

-Shawn

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson 
via Unicode
Sent: Friday, May 26, 2017 2:16 PM
To: Ken Whistler <kenwhist...@att.net>
Cc: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 05/26/2017 12:22 PM, Ken Whistler wrote:
> 
> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
>> The link provided about the PRI doesn't lead to the comments.
>>
> 
> PRI #121 (August, 2008) pre-dated the practice of keeping all the 
> feedback comments together with the PRI itself in a numbered directory 
> with the name "feedback.html". But the comments were collected 
> together at the time and are accessible here:
> 
> http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121
> 
> Also there was a separately submitted comment document:
> 
> http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt
> 
> And the minutes of the pertinent UTC meeting (UTC #116):
> 
> http://www.unicode.org/L2/L2008/08253.htm
> 
> The minutes simply capture the consensus to adopt Option #2 from PRI 
> #121, and the relevant action items.
> 
> I now return the floor to the distinguished disputants to continue 
> litigating history. ;-)
> 
> --Ken
> 
>

The reason this discussion got started was that in December, someone came to me 
and said the code I support does not follow Unicode best practices, and 
suggested I need to change, though no ticket (yet) has been filed.  I was 
surprised, and posted a query to this list about what the advantages of the new 
approach are.  There were a number of replies, but I did not see anything that 
seemed definitive.  After a month, I created a ticket in Unicode and Markus was 
assigned to research it, and came up with the proposal currently being debated.

Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print. 
That seems to be borne out by Markus, even with his stake in ICU, supporting 
option #2.

Looking at the comments, I don't see any discussion of the effect of this on 
overlong treatments.  My guess is that the effect change was unintentional.

So I have code that handled overlongs in the only correct way possible when 
they were acceptable, and in the obvious way after they became illegal, and now 
without apparent discussion (which is very much akin to "flimsy reasons"), it 
suddenly was no longer "best practice".  And that change came "rather late in 
the game".  That this escaped notice for years indicates that the specifics of 
REPLACEMENT CHAR handling don't matter all that much.

To cut to the chase, I think Unicode should issue a Corrigendum to the effect 
that it was never the intent of this change to say that treating overlongs as a 
single unit isn't best practice.  I'm not sure this warrants a full-fledge 
Corrigendum, though.  But I believe the text of the best practices should 
indicate that treating overlongs as a single unit is just as acceptable as 
Martin's interpretation.

I believe this is pretty much in line with Shawn's position.  Certainly, a 
discussion of the reasons one might choose one interpretation over another 
should be included in TUS.  That would likely have satisfied my original query, 
which hence would never have been posted.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode


On 05/26/2017 12:22 PM, Ken Whistler wrote:


On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:

The link provided about the PRI doesn't lead to the comments.



PRI #121 (August, 2008) pre-dated the practice of keeping all the 
feedback comments together with the PRI itself in a numbered directory 
with the name "feedback.html". But the comments were collected together 
at the time and are accessible here:


http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm

The minutes simply capture the consensus to adopt Option #2 from PRI 
#121, and the relevant action items.


I now return the floor to the distinguished disputants to continue 
litigating history. ;-)


--Ken




The reason this discussion got started was that in December, someone 
came to me and said the code I support does not follow Unicode best 
practices, and suggested I need to change, though no ticket (yet) has 
been filed.  I was surprised, and posted a query to this list about what 
the advantages of the new approach are.  There were a number of replies, 
but I did not see anything that seemed definitive.  After a month, I 
created a ticket in Unicode and Markus was assigned to research it, and 
came up with the proposal currently being debated.


Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print. 
That seems to be borne out by Markus, even with his stake in ICU, 
supporting option #2.


Looking at the comments, I don't see any discussion of the effect of 
this on overlong treatments.  My guess is that the effect change was 
unintentional.


So I have code that handled overlongs in the only correct way possible 
when they were acceptable, and in the obvious way after they became 
illegal, and now without apparent discussion (which is very much akin to 
"flimsy reasons"), it suddenly was no longer "best practice".  And that 
change came "rather late in the game".  That this escaped notice for 
years indicates that the specifics of REPLACEMENT CHAR handling don't 
matter all that much.


To cut to the chase, I think Unicode should issue a Corrigendum to the 
effect that it was never the intent of this change to say that treating 
overlongs as a single unit isn't best practice.  I'm not sure this 
warrants a full-fledge Corrigendum, though.  But I believe the text of 
the best practices should indicate that treating overlongs as a single 
unit is just as acceptable as Martin's interpretation.


I believe this is pretty much in line with Shawn's position.  Certainly, 
a discussion of the reasons one might choose one interpretation over 
another should be included in TUS.  That would likely have satisfied my 
original query, which hence would never have been posted.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Ken Whistler via Unicode



On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:

The link provided about the PRI doesn't lead to the comments.



PRI #121 (August, 2008) pre-dated the practice of keeping all the 
feedback comments together with the PRI itself in a numbered directory 
with the name "feedback.html". But the comments were collected together 
at the time and are accessible here:


http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm

The minutes simply capture the consensus to adopt Option #2 from PRI 
#121, and the relevant action items.


I now return the floor to the distinguished disputants to continue 
litigating history. ;-)


--Ken

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode


On 05/26/2017 04:28 AM, Martin J. Dürst wrote:
It may be worth to think about whether the Unicode standard should 
mention implementations like yours. But there should be no doubt about 
the fact that the PRI and Unicode 5.2 (and the current version of 
Unicode) are clear about what they recommend, and that that 
recommendation is based on the definition of UTF-8 at that time (and 
still in force), and not at based on a historical definition of UTF-8.


The link provided about the PRI doesn't lead to the comments.

Is there any evidence that there was a realization that the language 
being adopted would lead to overlongs being split into multiple subparts?

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Markus Scherer via Unicode

On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst 
wrote:

> But there's plenty in the text that makes it absolutely clear that some
> things cannot be included. In particular, it says
>
> 
> The term “maximal subpart of an ill-formed subsequence” refers to the code
> units that were collected in this manner. They could be the start of a
> well-formed sequence, except that the sequence lacks the proper
> continuation. Alternatively, the converter may have found an continuation
> code unit, which cannot be the start of a well-formed sequence.
> 
>
> And the "in this manner" refers to:
> 
> A sequence of code units will be processed up to the point where the
> sequence either can be unambiguously interpreted as a particular Unicode
> code point or where the converter recognizes that the code units collected
> so far constitute an ill-formed subsequence.
> 
>
> So we have the same thing twice: Bail out as soon as something is
> ill-formed.

The UTF-8 conversion code that I wrote for ICU, and apparently the code
that various other people have written, collects sequences starting from
lead bytes, according to the original spec, and at the end looks at whether
the assembled code point is too low for the lead byte, or is a surrogate,
or is above 10. Stopping at a non-trail byte is quite natural, and
reading the PRI text accordingly is quite natural too.

Aside from UTF-8 history, there is a reason for preferring a more
"structural" definition for UTF-8 over one purely along valid sequences.
This applies to code that *works* on UTF-8 strings rather than just
converting them. For UTF-8 *processing* you need to be able to iterate both
forward and backward, and sometimes you need not collect code points while
skipping over n units in either direction -- but your iteration needs to be
consistent in all cases. This is easier to implement (especially in fast,
short, inline code) if you have to look only at how many trail bytes follow
a lead byte, without having to look whether the first trail byte is in a
certain range for some specific lead bytes.

(And don't say that everyone can validate all strings once and then all
code can assume they are valid: That just does not work for library code,
you cannot assume anything about your input strings, and you cannot crash
when they are ill-formed.)

markus

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Philippe Verdy via Unicode

>
> Citing directly from the PRI:
>
> 
> The term "maximal subpart of the ill-formed subsequence" refers to the
> longest potentially valid initial subsequence or, if none, then to the next
> single code unit.
> 
>

The way i understand it is that C0 80 will have TWO maximal subparts,
because there's not any valid initial subsequence, so only the next single
code unit (C0) will be considered. After this the following byte 80 also
has not any valid initial subsequence, so here again only the next single
code unit (80) will be considered. You'll get U+FFFD replacements emitted
twice. This treats all cases of "overlong" sequences that were in the old
UTF-8 definition in the first RFC.

For C3 80 20, there will be only ONE maximal subpart because C3 80 is a
valid initial subsequence, so a single U+FFFD replacement will be emitted,
followed then by the valid UTF-8 sequence (20) which will correctly decode
as U+0020.

Good ! This means that this proposal makes sense and is compatible with
random accesses within the encoded text whithout having to look backward
for an indefinite number of code units and we never have to handle any case
with possibly infinite number of code units mapped to the same U+FFFD
replacement.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Martin J. Dürst via Unicode

On 2017/05/25 09:22, Markus Scherer wrote:

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson
wrote:

On 05/24/2017 12:46 AM, Martin J. Dürst wrote:

That's wrong. There was a public review issue with various options and
with feedback, and the recommendation has been implemented and in use
widely (among else, in major programming language and browsers) without
problems for quite some time.

Could you supply a reference to the PRI and its feedback?

http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)

It is correct that it didn't give any of the *examples* that are under
discussion now. On the other hand, the PRI is very clear about what it
means by "maximal subpart":

Citing directly from the PRI:

The term "maximal subpart of the ill-formed subsequence" refers to the
longest potentially valid initial subsequence or, if none, then to the
next single code unit.

At the time of the PRI, so-called "overlongs" were already ill-formed.

That change goes back to 2003 or earlier (RFC 3629
(https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect
the tightening of the UTF-8 definition in Unicode/ISO 10646).

The recommendation in TUS 5.2 is "Replace each maximal subpart of an

ill-formed subsequence by a single U+FFFD."

You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.

No, the text, both in the PRI and in Unicode 5.2, is quite clear. The
"does not fit" (which I haven't found in either text) is clearly
grounded by "ill-formed UTF-8". And there's no question about what
"ill-formed UTF-8" means, in particular in Unicode 5.2, where you just
have to go two pages back to find byte sequences such as , 80>, and all called out explicitly as ill-formed.

Any kind of claim, as in the L2/17-168 document, about there being an
option 2a, are just not substantiated. It's true that there are no
explicit examples in the PRI that would allow to distinguish between
converting e.g.

FC BF BF BF BF 80
to a single FFFD or to six of these. But there's no need to have
examples for every corner case if the text is clear enough. In the above
six-byte sequence, there's not a single potentially valid (initial)
subsequence, so it's all single code units.

And I agree with that. And I view an overlong sequence as a maximal
ill-formed subsequence

Can you point to any definition that would include or allow such an
interpretation? I just haven't found any yet, neither in the PRI nor in
Unicode 5.2.

that should be replaced by a single FFFD. There's
nothing in the text of 5.2 that immediately follows that recommendation
that indicates to me that my view is incorrect.

I have to agree that the text in Unicode 5.2 could be clearer. It's a
hodgepodge of attempts at justifications and definitions. And the word
"maximal" itself may also contribute to pushing the interpretation in
one direction.

But there's plenty in the text that makes it absolutely clear that some
things cannot be included. In particular, it says

The term “maximal subpart of an ill-formed subsequence” refers to the
code units that were collected in this manner. They could be the start
of a well-formed sequence, except that the sequence lacks the proper
continuation. Alternatively, the converter may have found an
continuation code unit, which cannot be the start of a well-formed sequence.

And the "in this manner" refers to:

A sequence of code units will be processed up to the point where the
sequence either can be unambiguously interpreted as a particular Unicode
code point or where the converter recognizes that the code units
collected so far constitute an ill-formed subsequence.

So we have the same thing twice: Bail out as soon as something is
ill-formed.

Perhaps my view is colored by the fact that I now maintain code that was
written to parse UTF-8 back when overlongs were still considered legal
input.

Thanks for providing this information. That's a lot more useful than
"feels right", which was given as a reason on this list before.

An overlong was a single unit. When they became illegal, the code
still considered them a single unit.

That's fine for your code. I might do the same (or not) if I were you,
because one indeed never knows in which situation some code is used, and
what repercussions a change might produce.

But the PRI, and the wording in Unicode 5.2, was created when overlongs
and 5-byte and 6-byte sequences and surrogate pairs,... were very
clearly

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Markus Scherer via Unicode

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson 
wrote:

> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>
>> That's wrong. There was a public review issue with various options and
>> with feedback, and the recommendation has been implemented and in use
>> widely (among else, in major programming language and browsers) without
>> problems for quite some time.
>>
>
> Could you supply a reference to the PRI and its feedback?
>

http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)

The recommendation in TUS 5.2 is "Replace each maximal subpart of an
> ill-formed subsequence by a single U+FFFD."
>

You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.

And I agree with that.  And I view an overlong sequence as a maximal
> ill-formed subsequence that should be replaced by a single FFFD. There's
> nothing in the text of 5.2 that immediately follows that recommendation
> that indicates to me that my view is incorrect.
>
> Perhaps my view is colored by the fact that I now maintain code that was
> written to parse UTF-8 back when overlongs were still considered legal
> input.  An overlong was a single unit.  When they became illegal, the code
> still considered them a single unit.
>

Right.

I can understand how someone who comes along later could say C0 can't be
> followed by any continuation character that doesn't yield an overlong,
> therefore C0 is a maximal subsequence.
>

Right.

But I assert that my interpretation is just as valid as that one.  And
> perhaps more so, because of historical precedent.
>

I agree.

markus

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Karl Williamson via Unicode


On 05/24/2017 12:46 AM, Martin J. Dürst wrote:

On 2017/05/24 05:57, Karl Williamson via Unicode wrote:

On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:



Adding a "recommendation" this late in the game is just bad standards
policy.



Unless I misunderstand, you are missing the point.  There is already a
recommendation listed in TUS,


That's indeed correct.



and that recommendation appears to have
been added without much thought.


That's wrong. There was a public review issue with various options and 
with feedback, and the recommendation has been implemented and in use 
widely (among else, in major programming language and browsers) without 
problems for quite some time.


Could you supply a reference to the PRI and its feedback?

The recommendation in TUS 5.2 is "Replace each maximal subpart of an 
ill-formed subsequence by a single U+FFFD."


And I agree with that.  And I view an overlong sequence as a maximal 
ill-formed subsequence that should be replaced by a single FFFD. 
There's nothing in the text of 5.2 that immediately follows that 
recommendation that indicates to me that my view is incorrect.


Perhaps my view is colored by the fact that I now maintain code that was 
written to parse UTF-8 back when overlongs were still considered legal 
input.  An overlong was a single unit.  When they became illegal, the 
code still considered them a single unit.


I can understand how someone who comes along later could say C0 can't be 
followed by any continuation character that doesn't yield an overlong, 
therefore C0 is a maximal subsequence.


But I assert that my interpretation is just as valid as that one.  And 
perhaps more so, because of historical precedent.


It appears to me that little thought was given to the fact that these 
changes would cause overlongs to now be at least two units instead of 
one, making long existing code no longer be best practice.  You are 
effectively saying I'm wrong about this.  I thought I had been paying 
attention to PRI's since the 5.x series, and I don't remember anything 
about this.  If you have evidence to the contrary, please give it. 
However, I would have thought Markus would have dug any up and given it 
in his proposal.






There is no proposal to add a
recommendation "this late in the game".


True. The proposal isn't for an addition, it's for a change. The "late 
in the game" however, still applies.


Regards,   Martin.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Martin J. Dürst via Unicode


On 2017/05/24 05:57, Karl Williamson via Unicode wrote:

On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:



Adding a "recommendation" this late in the game is just bad standards
policy.



Unless I misunderstand, you are missing the point.  There is already a
recommendation listed in TUS,


That's indeed correct.



and that recommendation appears to have
been added without much thought.


That's wrong. There was a public review issue with various options and 
with feedback, and the recommendation has been implemented and in use 
widely (among else, in major programming language and browsers) without 
problems for quite some time.




There is no proposal to add a
recommendation "this late in the game".


True. The proposal isn't for an addition, it's for a change. The "late 
in the game" however, still applies.


Regards,   Martin.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Karl Williamson via Unicode


On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:

On 5/23/2017 10:45 AM, Markus Scherer wrote:
On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode 
> wrote:


So, if the proposal for Unicode really was more of a "feels right"
and not a "deviate at your peril" situation (or necessary escape
hatch), then we are better off not making a RECOMMEDATION that
goes against collective practice.


I think the standard is quite clear about this:

Although a UTF-8 conversion process is required to never consume
well-formed subsequences as part of its error handling for
ill-formed subsequences, such a process is not otherwise
constrained in how it deals with any ill-formed subsequence
itself. An ill-formed subsequence consisting of more than one code
unit could be treated as a single error or as multiple errors.


And why add a recommendation that changes that from completely up to the 
implementation (or groups of implementations) to something where one way 
of doing it now has to justify itself?


If the thread has made one thing clear is that there's no consensus in 
the wider community that one approach is obviously better. When it comes 
to ill-formed sequences, all bets are off. Simple as that.


Adding a "recommendation" this late in the game is just bad standards 
policy.


A./




Unless I misunderstand, you are missing the point.  There is already a 
recommendation listed in TUS, and that recommendation appears to have 
been added without much thought.  There is no proposal to add a 
recommendation "this late in the game".

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Doug Ewell via Unicode

Asmus Freytag \(c\) wrote:

> And why add a recommendation that changes that from completely up to
> the implementation (or groups of implementations) to something where
> one way of doing it now has to justify itself?

A recommendation already exists, at the end of Section 3.9. The current
proposal is to change it to recommend something else. 

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode

> If the thread has made one thing clear is that there's no consensus in the 
> wider community
> that one approach is obviously better. When it comes to ill-formed sequences, 
> all bets are off.
> Simple as that.

> Adding a "recommendation" this late in the game is just bad standards policy.

I agree.  I'm not sure what value this provides.  If someone thought it added 
value to discuss the pros and cons of implementing it one way and the other as 
MAY do this or MAY do that, I don't mind.  But I think both should be 
permitted, and neither should be encouraged with anything stronger than a MAY.

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Asmus Freytag (c) via Unicode


On 5/23/2017 10:45 AM, Markus Scherer wrote:
On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode 
> wrote:


So, if the proposal for Unicode really was more of a "feels right"
and not a "deviate at your peril" situation (or necessary escape
hatch), then we are better off not making a RECOMMEDATION that
goes against collective practice.


I think the standard is quite clear about this:

Although a UTF-8 conversion process is required to never consume
well-formed subsequences as part of its error handling for
ill-formed subsequences, such a process is not otherwise
constrained in how it deals with any ill-formed subsequence
itself. An ill-formed subsequence consisting of more than one code
unit could be treated as a single error or as multiple errors.


And why add a recommendation that changes that from completely up to the 
implementation (or groups of implementations) to something where one way 
of doing it now has to justify itself?


If the thread has made one thing clear is that there's no consensus in 
the wider community that one approach is obviously better. When it comes 
to ill-formed sequences, all bets are off. Simple as that.


Adding a "recommendation" this late in the game is just bad standards 
policy.


A./

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Alastair Houghton via Unicode


> On 23 May 2017, at 18:45, Markus Scherer via Unicode  
> wrote:
> 
> On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode 
>  wrote:
>> So, if the proposal for Unicode really was more of a "feels right" and not a 
>> "deviate at your peril" situation (or necessary escape hatch), then we are 
>> better off not making a RECOMMEDATION that goes against collective practice.
> 
> I think the standard is quite clear about this:
> 
> Although a UTF-8 conversion process is required to never consume well-formed 
> subsequences as part of its error handling for ill-formed subsequences, such 
> a process is not otherwise constrained in how it deals with any ill-formed 
> subsequence itself. An ill-formed subsequence consisting of more than one 
> code unit could be treated as a single error or as multiple errors.

Agreed.  That paragraph is entirely clear.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Markus Scherer via Unicode

On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> So, if the proposal for Unicode really was more of a "feels right" and not
> a "deviate at your peril" situation (or necessary escape hatch), then we
> are better off not making a RECOMMEDATION that goes against collective
> practice.
>

I think the standard is quite clear about this:

Although a UTF-8 conversion process is required to never consume
well-formed subsequences as part of its error handling for ill-formed
subsequences, such a process is not otherwise constrained in how it deals
with any ill-formed subsequence itself. An ill-formed subsequence
consisting of more than one code unit could be treated as a single error or
as multiple errors.

markus

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode

+ the list, which somehow my reply seems to have lost.

> I may have missed something, but I think nobody actually proposed to change 
> the recommendations into requirements

No thanks, that would be a breaking change for some implementations (like mine) 
and force them to become non-complying or potentially break customer behavior.

I would prefer that both options be permitted, perhaps with a few words of 
advantages.

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Asmus Freytag via Unicode


  
  
On 5/23/2017 1:24 AM, Martin J. Dürst
  via Unicode wrote:

Hello
  Mark,
  
  
  On 2017/05/22 01:37, Mark Davis ☕️ via Unicode wrote:
  
  I actually didn't see any of this
discussion until today.

  
  
  Many thanks for chiming in.
  
  
  (

unicode@unicode.org mail was going into my spam folder...) I
started

reading the thread, but it looks like a lot of it is OT,

  
  
  As is quite usual on mailing list :-(.
  
  
  so just scanned

some of them.


A few brief points:


   1. There is plenty of time for public comment, since it was

targeted at *Unicode

   11*, the release for about a year from now, *not* *Unicode
10*, due this

   year.

   2. When the UTC "approves a change", that change is subject
to comment,

   and the UTC can always reverse or modify its approval up
until the meeting

   before release date. *So there are ca. 9 months in which to
comment.*

  
  
  This is good to hear. What's the best way to submit such comments?
  
  
     3. The modified text is a set of
guidelines, not requirements. So no

   conformance clause is being changed.

   - If people really believed that the guidelines in that
section should

  have been conformance clauses, they should have proposed
that at

some point.

  
  
  I may have missed something, but I think nobody actually proposed
  to change the recommendations into requirements. I think everybody
  understands that there are several ways to do things, and
  situations where one or the other is preferred. The only advantage
  of changing the current recommendations to requirements would be
  to make it more difficult for them to be changed.
  


In this context it's worth looking at other standards organization's
use of "recommended", because that may explain a lot of people's
unease with this. For example, IETF has RFC 2119 which says:
1. MUST  This word, or the terms "REQUIRED" or "SHALL", mean that the
   definition is an absolute requirement of the specification.

...
3. SHOULD   This word, or the adjective "RECOMMENDED", mean that there
   may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.

..

5. MAY   This word, or the adjective "OPTIONAL", mean that an item is
   truly optional.  One vendor may choose to include the item because a
   particular marketplace requires it or because the vendor feels that
   it enhances the product while another vendor may omit the same item.
   An implementation which does not include a particular option MUST be
   prepared to interoperate with another implementation which does
   include the option, though perhaps with reduced functionality. In the
   same vein an implementation which does include a particular option
   MUST be prepared to interoperate with another implementation which
   does not include the option (except, of course, for the feature the
   option provides.)

Reading this, it's clear that "RECOMMENDED" is not merely  a "we
think this is the best way to do it" but a rather sterner "you
deviate at your peril" kind of statement.

The latter is what makes it difficult for others to collectively
agree on a different choice faced with a formal RECOMMENDATION.

So, if the proposal for Unicode really was more of a "feels right"
and not a "deviate at your peril" situation (or necessary escape
hatch), then we are better off not making a RECOMMEDATION that goes
against collective practice.
A./



  
  I think the situation at hand is somewhat special: Recommendations
  are okay. But there's a strong wish from downstream communities
  such asWeb browser implementers and programming language/library
  implementers to not change these recommendations. Some of these
  communities have stricter requirement for alignment, and some have
  followed longstanding recommendations in the absence of specific
  arguments for something different.
  
  
  Regards,   Martin.
  
  
    - And still can proposal that — as I
said, there is plenty of time.



Mark


On Wed, May 17, 2017 at 10:41 PM, Doug Ewell via Unicode <

unicode@unicode.org> wrote:


Henri Sivonen wrote:
  
  
  I find it

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Alastair Houghton via Unicode

On 23 May 2017, at 07:10, Jonathan Coxhead via Unicode  
wrote:
> 
> On 18/05/2017 1:58 am, Alastair Houghton via Unicode wrote:
>> On 18 May 2017, at 07:18, Henri Sivonen via Unicode 
>>  wrote:
>> 
>>> the decision complicates U+FFFD generation when validating UTF-8 by state 
>>> machine.
>>> 
>> It *really* doesn’t.  Even if you’re hell bent on using a pure state machine 
>> approach, you need to add maybe two additional error states 
>> (two-trailing-bytes-to-eat-then-fffd and one-trailing-byte-to-eat-then-fffd) 
>> on top of the states you already have.  The implementation complexity 
>> argument is a *total* red herring.
> 
> Heh. A state machine with N+2 states is, a fortiori, more complex than one 
> with N states. So I think your argument is self-contradictory.

You’re being overly pedantic (and in this case, actually, the cyclomatic 
complexity of the state machine wouldn’t increase).  In any case, Henri is 
complaining that it’s too difficult to implement; it isn’t.  You need two extra 
states, both of which are trivial.

The point I was making was that this is not a strong argument against the 
proposed change, *even if* we were treating it as a requirement, which it isn’t.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Martin J. Dürst via Unicode


Hello Mark,

On 2017/05/22 01:37, Mark Davis ☕️ via Unicode wrote:

I actually didn't see any of this discussion until today.


Many thanks for chiming in.


(
unicode@unicode.org mail was going into my spam folder...) I started
reading the thread, but it looks like a lot of it is OT,


As is quite usual on mailing list :-(.


so just scanned
some of them.

A few brief points:

   1. There is plenty of time for public comment, since it was
targeted at *Unicode
   11*, the release for about a year from now, *not* *Unicode 10*, due this
   year.
   2. When the UTC "approves a change", that change is subject to comment,
   and the UTC can always reverse or modify its approval up until the meeting
   before release date. *So there are ca. 9 months in which to comment.*


This is good to hear. What's the best way to submit such comments?


   3. The modified text is a set of guidelines, not requirements. So no
   conformance clause is being changed.
   - If people really believed that the guidelines in that section should
  have been conformance clauses, they should have proposed that at
some point.


I may have missed something, but I think nobody actually proposed to 
change the recommendations into requirements. I think everybody 
understands that there are several ways to do things, and situations 
where one or the other is preferred. The only advantage of changing the 
current recommendations to requirements would be to make it more 
difficult for them to be changed.


I think the situation at hand is somewhat special: Recommendations are 
okay. But there's a strong wish from downstream communities such asWeb 
browser implementers and programming language/library implementers to 
not change these recommendations. Some of these communities have 
stricter requirement for alignment, and some have followed longstanding 
recommendations in the absence of specific arguments for something 
different.


Regards,   Martin.


  - And still can proposal that — as I said, there is plenty of time.


Mark

On Wed, May 17, 2017 at 10:41 PM, Doug Ewell via Unicode <
unicode@unicode.org> wrote:


Henri Sivonen wrote:


I find it shocking that the Unicode Consortium would change a
widely-implemented part of the standard (regardless of whether Unicode
itself officially designates it as a requirement or suggestion) on
such flimsy grounds.

I'd like to register my feedback that I believe changing the best
practices is wrong.


Perhaps surprisingly, it's already too late. UTC approved this change
the day after the proposal was written.

http://www.unicode.org/L2/L2017/17103.htm#151-C19

--
Doug Ewell | Thornton, CO, US | ewellic.org







--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Jonathan Coxhead via Unicode


On 18/05/2017 1:58 am, Alastair Houghton via Unicode wrote:

On 18 May 2017, at 07:18, Henri Sivonen via Unicode  wrote:

the decision complicates U+FFFD generation when validating UTF-8 by state 
machine.

It *really* doesn’t.  Even if you’re hell bent on using a pure state machine 
approach, you need to add maybe two additional error states 
(two-trailing-bytes-to-eat-then-fffd and one-trailing-byte-to-eat-then-fffd) on 
top of the states you already have.  The implementation complexity argument is 
a *total* red herring.


   Heh. A state machine with N+2 states is, /a fortiori/, more complex 
than one with N states. So I think your argument is self-contradictory.

Alastair.

～ʝ

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-21 Thread Mark Davis ☕️ via Unicode

I actually didn't see any of this discussion until today. (
unicode@unicode.org mail was going into my spam folder...) I started
reading the thread, but it looks like a lot of it is OT, so just scanned
some of them.

A few brief points:

   1. There is plenty of time for public comment, since it was
targeted at *Unicode
   11*, the release for about a year from now, *not* *Unicode 10*, due this
   year.
   2. When the UTC "approves a change", that change is subject to comment,
   and the UTC can always reverse or modify its approval up until the meeting
   before release date. *So there are ca. 9 months in which to comment.*
   3. The modified text is a set of guidelines, not requirements. So no
   conformance clause is being changed.
   - If people really believed that the guidelines in that section should
  have been conformance clauses, they should have proposed that at
some point.
  - And still can proposal that — as I said, there is plenty of time.


Mark

On Wed, May 17, 2017 at 10:41 PM, Doug Ewell via Unicode <
unicode@unicode.org> wrote:

> Henri Sivonen wrote:
>
> > I find it shocking that the Unicode Consortium would change a
> > widely-implemented part of the standard (regardless of whether Unicode
> > itself officially designates it as a requirement or suggestion) on
> > such flimsy grounds.
> >
> > I'd like to register my feedback that I believe changing the best
> > practices is wrong.
>
> Perhaps surprisingly, it's already too late. UTC approved this change
> the day after the proposal was written.
>
> http://www.unicode.org/L2/L2017/17103.htm#151-C19
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Richard Wordingham via Unicode

On Thu, 18 May 2017 09:58:43 +0100
Alastair Houghton via Unicode  wrote:

> On 18 May 2017, at 07:18, Henri Sivonen via Unicode
>  wrote:
> > 
> > the decision complicates U+FFFD generation when validating UTF-8 by
> > state machine.  
> 
> It *really* doesn’t.  Even if you’re hell bent on using a pure state
> machine approach, you need to add maybe two additional error states
> (two-trailing-bytes-to-eat-then-fffd and
> one-trailing-byte-to-eat-then-fffd) on top of the states you already
> have.  The implementation complexity argument is a *total* red
> herring.

For big programs, yes.  However, for a small program it can be
attractive to have a small hand-coded routine so that the source code
can sit in a single file.  It can even allow a basically UTF-8 program
to meet a requirement to be able to match lone surrogates in a regular
expression, as was once required.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode

On 18 May 2017, at 07:18, Henri Sivonen via Unicode  wrote:
> 
> the decision complicates U+FFFD generation when validating UTF-8 by state 
> machine.

It *really* doesn’t.  Even if you’re hell bent on using a pure state machine 
approach, you need to add maybe two additional error states 
(two-trailing-bytes-to-eat-then-fffd and one-trailing-byte-to-eat-then-fffd) on 
top of the states you already have.  The implementation complexity argument is 
a *total* red herring.

> 2) Procedural: To be considered in the future, proposals to change
> what the standard suggests or requires implementations to do should
> consider different implementation strategies and discuss the impact of
> the change in the light of the different implementation strategies (in
> the matter at hand, I think the proposal should have included a
> discussion of the impact on UTF-8 validation state machines)

Well, let’s discuss that here and now (see above).  Do you, for some reason, 
think that it’s more complicated than I suggest?

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Hans Åberg via Unicode


> On 16 May 2017, at 15:21, Richard Wordingham via Unicode 
>  wrote:
> 
> On Tue, 16 May 2017 14:44:44 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 15 May 2017, at 12:21, Henri Sivonen via Unicode
>>>  wrote:  
>> ...
>>> I think Unicode should not adopt the proposed change.  
>> 
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original octet
>> sequence can be restored.
> 
> Escape sequences for the inappropriate bytes is the natural technique.
> Your problem is smoothly transitioning so that the escape character is
> always escaped when it means itself. Strictly, it can't be done.
> 
> Of course, some sequences of escaped characters should be prohibited.
> Checking could be fiddly.

One could write the bytes using \xnn escape codes, sequences terminated using 
\& as in Haskell, translating '\' into "\\". It then becomes a C-encoded 
string, not plain text.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode

On 18 May 2017, at 06:01, Richard Wordingham via Unicode  
wrote:
> 
> On Thu, 18 May 2017 02:04:55 +0200
> Philippe Verdy via Unicode  wrote:
> 
>> I find intriguating that the update intends to enforce the decoding
>> of the **shortest** sequences, but now wants to treat **maximal
>> sequences** as a single unit with arbitrary length. UTF-8 was
>> designed to work only with some state machines that would NEVER need
>> to parse more than 4 bytes.
> 
> If you look at the sample code in
> http://www.unicode.org/versions/Unicode2.0.0/appA.pdf, you'll see that
> it's working with 6-byte sequences.  It's the Unicode, as opposed to
> ISO 10646, version that has always been restricted to 4 bytes.

There are good reasons for restricting it to four byte sequences, mind; doing 
so increases the number of invalid code units, which makes it easier to detect 
UTF-8 versus not UTF-8.  I don’t think anyone is proposing allowing 5-byte or 
6-byte sequences.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode

On 18 May 2017, at 01:04, Philippe Verdy via Unicode  
wrote:
> 
> I find intriguating that the update intends to enforce the decoding of the 
> **shortest** sequences, but now wants to treat **maximal sequences** as a 
> single unit with arbitrary length. UTF-8 was designed to work only with some 
> state machines that would NEVER need to parse more than 4 bytes.

This won’t change.  You still don’t need to parse more than four bytes.  In 
fact, you don’t need to do *anything*, even if your implementation doesn’t 
match the proposal, because *it’s only a recommendation*.  But if you did 
choose to do something, you *still* don’t need to scan arbitrary numbers of 
bytes.

> For me, as soon as the first byte encountered is invalid, the current 
> sequence should be stopped there and treated as error (replaced by U+FFFD is 
> replacement is enabled instead of returning an error or throwing an 
> exception),

This is still essentially true under the proposal; the only difference is that 
instead of being a clever dick and taking account of the valid *code point* 
ranges while doing this in order to ban certain trailing bytes given the values 
of their predecessors, you allow any trailing byte, and only worry about 
whether the complete sequence represents a valid code point or is over-long 
once you’ve finished reading it.  You never need to read more than four bytes 
under the new proposal, because the lead byte tells you how many to expect, and 
you’d still stop and instantly replace with U+FFFD if you see a byte outside 
the 0x80-0xbf range, even if you hadn’t scanned the number of bytes the lead 
byte says to expect.

This also *does not* change the view of the underlying UTF-8 string based on 
iteration direction; you would still generate the exact same sequence of code 
points in both directions.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Henri Sivonen via Unicode

On Thu, May 18, 2017 at 2:41 AM, Asmus Freytag via Unicode
 wrote:
> On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote:
>
> There's some sort of rule that proposals should be made seven days in
> advance of the meeting.  I can't find it now, so I'm not sure whether
> the actual rule was followed, let alone what authority it has.
>
> Ideally, proposals that update algorithms or properties of some significance
> should be required to be reviewed in more than one pass. The procedures of
> the UTC are a bit weak in that respect, at least compared to other standards
> organizations. The PRI process addresses that issue to some extent.

What action should I take to make proposals to be considered by the UTC?

I'd like to make two:

 1) Substantive: Reverse the decision to modify U+FFFD best practice
when decoding UTF-8. (I think the decision lacked a truly compelling
reason to change something that has a number of prominent
implementations and the decision complicates U+FFFD generation when
validating UTF-8 by state machine. Aesthetic considerations in error
handling shouldn't outweigh multiple prominent implementations and
shouldn't introduce implementation complexity.)

 2) Procedural: To be considered in the future, proposals to change
what the standard suggests or requires implementations to do should
consider different implementation strategies and discuss the impact of
the change in the light of the different implementation strategies (in
the matter at hand, I think the proposal should have included a
discussion of the impact on UTF-8 validation state machines) and
should include a review of what prominent implementations, including
major browser engines, operating system libraries, and standard
libraries of well-known programming languages, already do. (The more
established the presently specced behavior is among prominent
implementations, the more compelling reason should be required to
change the spec. An implementation hosted by the Consortium itself
shouldn't have special weight compared to other prominent
implementations.)

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Thu, 18 May 2017 02:04:55 +0200
Philippe Verdy via Unicode  wrote:

> I find intriguating that the update intends to enforce the decoding
> of the **shortest** sequences, but now wants to treat **maximal
> sequences** as a single unit with arbitrary length. UTF-8 was
> designed to work only with some state machines that would NEVER need
> to parse more than 4 bytes.

If you look at the sample code in
http://www.unicode.org/versions/Unicode2.0.0/appA.pdf, you'll see that
it's working with 6-byte sequences.  It's the Unicode, as opposed to
ISO 10646, version that has always been restricted to 4 bytes.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8


Richard Wordingham wrote:


I'm afraid I don't get the analogy.


You can't build a full Unicode system out of Unicode-compliant parts.


Others will have to address Richard's point about canonical-equivalent 
sequences.



However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8
(in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the
critical wording, "When converting from UTF-8 to Unicode values,
however, implementations do not need to check that the shortest
encoding is being used,...". There was no prohibition on
implementations performing the check, so whether C0 80 would be
interpreted as U+ or as an error was unpredictable.


So it is as I said, and as TUS said before Corrigendum #1 was approved, 
more than 16 years ago: It was not legal to create overlong sequences, 
but implementations were allowed to interpret any that they came across.


As someone who pays attention to the fine details, you will certainly 
appreciate the difference between "it was once legal to encode NUL as E0 
80 80" and "it was once legal for a decoder to interpret the sequence E0 
80 80 as NUL instead of rejecting it."


--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Philippe Verdy via Unicode

I find intriguating that the update intends to enforce the decoding of the
**shortest** sequences, but now wants to treat **maximal sequences** as a
single unit with arbitrary length. UTF-8 was designed to work only with
some state machines that would NEVER need to parse more than 4 bytes.

For me, as soon as the first byte encountered is invalid, the current
sequence should be stopped there and treated as error (replaced by U+FFFD
is replacement is enabled instead of returning an error or throwing an
exception), and then any further trailing byte should be treated isolated
as an error: The number of returned U+FFFD replacements would then be the
same when you scan the input forward or backward without **ever** reading
more than 4 bytes in all directions (this is a problem when the parseing
will reach an end of buffer where you'll block on performing I/O to read
the previous or next block, and managing a cache of multiple blocks (more
than 2) is a problem with this unexpected change that will create new
performance problems and add new memory constraints (in adition to new
possible attacks if that parser needs to keep multiple buffers in memorty
instead of treating them individually with a single overhead buffer, and
throwing away the individual buffers on the fly as soon as they are
indivisually fully parsed).

2017-05-18 1:41 GMT+02:00 Asmus Freytag via Unicode :

> On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote:
>
> There's some sort of rule that proposals should be made seven days in
> advance of the meeting.  I can't find it now, so I'm not sure whether
> the actual rule was followed, let alone what authority it has.
>
> Ideally, proposals that update algorithms or properties of some
> significance should be required to be reviewed in more than one pass. The
> procedures of the UTC are a bit weak in that respect, at least compared to
> other standards organizations. The PRI process addresses that issue to some
> extent.
>
> A./
>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Asmus Freytag via Unicode


  
  
On 5/17/2017 2:31 PM, Richard
  Wordingham via Unicode wrote:


  There's some sort of rule that proposals should be made seven days in
advance of the meeting.  I can't find it now, so I'm not sure whether
the actual rule was followed, let alone what authority it has.

Ideally, proposals that update algorithms or
properties of some significance should be required to be reviewed
in more than one pass. The procedures of the UTC are a bit weak
in that respect, at least compared to other standards
organizations. The PRI process addresses that issue to some
extent.
A./

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Wed, 17 May 2017 15:31:56 -0700
Doug Ewell via Unicode  wrote:

> Richard Wordingham wrote:
> 
> > So it was still a legal way for a non-UTF-8-compliant process!  
> 
> Anything is possible if you are non-compliant. You can encode U+263A
> with 9,786 FF bytes followed by a terminating FE byte and call that
> "UTF-8," if you are willing to be non-compliant enough.
> 
> > Note for example that a compliant implementation of full
> > upper-casing shall convert the canonically equivalent strings
> >  > COMBINING COMMA ABOVE> and  > PSILI, U+0345 COMBINING GREEK  
> > YPOGEGRAMMENI> to the canonically inequivalent strings  > YPOGEGRAMMENI> GREEK  
> > CAPITAL LETTER ALPHA, U+0399 GREEK CAPITAL LETTER IOTA, U+0313> and
> >  > LETTER IOTA>. A compliant Unicode process may not assume that this
> > is the right thing to do. (Or are some compliant Unicode processes
> > required to incorrectly believe that they are doing something they
> > mustn't do?)  
> 
> I'm afraid I don't get the analogy.

You can't build a full Unicode system out of Unicode-compliant parts.

However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8
(in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the
critical wording, "When converting from UTF-8 to Unicode values,
however, implementations do not need to check that the shortest
encoding is being used,...".  There was no prohibition on
implementations performing the check, so whether C0 80 would be
interpreted as U+ or as an error was unpredictable.

Richard.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Richard Wordingham wrote:

> So it was still a legal way for a non-UTF-8-compliant process!

Anything is possible if you are non-compliant. You can encode U+263A
with 9,786 FF bytes followed by a terminating FE byte and call that
"UTF-8," if you are willing to be non-compliant enough.

> Note for example that a compliant implementation of full upper-casing
> shall convert the canonically equivalent strings  LETTER ALPHA WITH YPOGEGRAMMENI, U+0313 COMBINING COMMA ABOVE> and
>  YPOGEGRAMMENI> to the canonically inequivalent strings  CAPITAL LETTER ALPHA, U+0399 GREEK CAPITAL LETTER IOTA, U+0313> and
>  LETTER IOTA>. A compliant Unicode process may not assume that this is
> the right thing to do. (Or are some compliant Unicode processes
> required to incorrectly believe that they are doing something they
> mustn't do?)

I'm afraid I don't get the analogy.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Wed, 17 May 2017 13:37:51 -0700
Doug Ewell via Unicode  wrote:

> Richard Wordingham wrote:
> 
> >> It is not at all clear what the intent of the encoder was - or even
> >> if it's not just a problem with the data stream. E0 80 80 is not
> >> permitted, it's garbage. An encoder can't "intend" it.  
> >
> > It was once a legal way of encoding NUL, just like C0 E0, which is
> > still in use, and seems to be the best way of storing NUL as
> > character content in a *C string*.  
> 
> I wish I had a penny for every time I'd seen this urban legend.
> 
> At http://doc.cat-v.org/bell_labs/utf-8_history you can read the
> original definition of UTF-8, from Ken Thompson on 1992-09-08, so long
> ago that it was still called FSS-UTF:
> 
> "When there are multiple ways to encode a value, for example
> UCS 0, only the shortest encoding is legal."
> 
> Unicode once permitted implementations to *decode* non-shortest forms,
> but never allowed an implementation to *create* them
> (http://www.unicode.org/versions/corrigendum1.html):
> 
> "For example, UTF-8 allows nonshortest code value sequences to be
> interpreted: a UTF-8 conformant may map the code value sequence C0 80
> (1100₂ 1000₂) to the Unicode value U+, even though a
> UTF-8 conformant process shall never generate that code value sequence
> -- it shall generate the sequence 00 (₂) instead."
> 
> This was the passage that was deleted as part of Corrigendum #1.

So it was still a legal way for a non-UTF-8-compliant process!  Note
for example that a compliant implementation of full upper-casing
shall convert the canonically equivalent strings  and
  to the canonically inequivalent strings  and
.  A compliant Unicode process may not assume that this is
the right thing to do.  (Or are some compliant Unicode processes
required to incorrectly believe that they are doing something they
mustn't do?)

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Wed, 17 May 2017 13:41:56 -0700
Doug Ewell via Unicode  wrote:

> Perhaps surprisingly, it's already too late. UTC approved this change
> the day after the proposal was written.
> 
> http://www.unicode.org/L2/L2017/17103.htm#151-C19

Approved for Unicode 11.0.  Unicode 10.0 has yet to be released.  The
change may still be rescinded.

There's some sort of rule that proposals should be made seven days in
advance of the meeting.  I can't find it now, so I'm not sure whether
the actual rule was followed, let alone what authority it has.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Hans Åberg via Unicode


> On 17 May 2017, at 23:18, Doug Ewell  wrote:
> 
> Hans Åberg wrote:
> 
>>> Far from solving the stated problem, it would introduce a new one:
>>> conversion from the "bad data" Unicode code points, currently
>>> well-defined, would become ambiguous.
>> 
>> Actually not: just translate the invalid UTF-8 sequences into invalid
>> UTF-32.
> 
> Far from solving the stated problem, it would introduce TWO new ones...

There is no good solution to the problem of illegal UTF-8 sequences, as the 
intent of those is not known.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Hans Åberg wrote:

>> Far from solving the stated problem, it would introduce a new one:
>> conversion from the "bad data" Unicode code points, currently
>> well-defined, would become ambiguous.
>
> Actually not: just translate the invalid UTF-8 sequences into invalid
> UTF-32.

Far from solving the stated problem, it would introduce TWO new ones...
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Hans Åberg via Unicode

> On 17 May 2017, at 22:36, Doug Ewell via Unicode  wrote:
> 
> Hans Åberg wrote:
> 
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original
>> octet sequence can be restored. 
> 
> I have always argued strongly against this idea, and always will.
> 
> Far from solving the stated problem, it would introduce a new one:
> conversion from the "bad data" Unicode code points, currently
> well-defined, would become ambiguous.

Actually not: just translate the invalid UTF-8 sequences into invalid UTF-32. 
No Unicode extensions are needed, as it has no say about what to happen with 
what it considers invalid.

> File systems cannot have it both ways: they must define file names
> either as unrestricted sequences of bytes, or as strings of characters
> in some defined encoding. If they choose the latter, they need to define
> conversion mechanisms with suitable fallback and adhere to them. They
> can use the PUA if they like. 

The latter is complicated, so that is not what one does I am told, with some 
exception. Also, one may end up with a file in an unknown encoding, say 
imported remotely, and then the OS cannot deal with it.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Henri Sivonen wrote:

> I find it shocking that the Unicode Consortium would change a
> widely-implemented part of the standard (regardless of whether Unicode
> itself officially designates it as a requirement or suggestion) on
> such flimsy grounds.
>
> I'd like to register my feedback that I believe changing the best
> practices is wrong.

Perhaps surprisingly, it's already too late. UTC approved this change
the day after the proposal was written.

http://www.unicode.org/L2/L2017/17103.htm#151-C19
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Richard Wordingham wrote:

>> It is not at all clear what the intent of the encoder was - or even
>> if it's not just a problem with the data stream. E0 80 80 is not
>> permitted, it's garbage. An encoder can't "intend" it.
>
> It was once a legal way of encoding NUL, just like C0 E0, which is
> still in use, and seems to be the best way of storing NUL as character
> content in a *C string*.

I wish I had a penny for every time I'd seen this urban legend.

At http://doc.cat-v.org/bell_labs/utf-8_history you can read the
original definition of UTF-8, from Ken Thompson on 1992-09-08, so long
ago that it was still called FSS-UTF:

"When there are multiple ways to encode a value, for example
UCS 0, only the shortest encoding is legal."

Unicode once permitted implementations to *decode* non-shortest forms,
but never allowed an implementation to *create* them
(http://www.unicode.org/versions/corrigendum1.html):

"For example, UTF-8 allows nonshortest code value sequences to be
interpreted: a UTF-8 conformant may map the code value sequence C0 80
(1100₂ 1000₂) to the Unicode value U+, even though a
UTF-8 conformant process shall never generate that code value sequence
-- it shall generate the sequence 00 (₂) instead."

This was the passage that was deleted as part of Corrigendum #1.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Hans Åberg wrote:

> It would be useful, for use with filesystems, to have Unicode
> codepoint markers that indicate how UTF-8, including non-valid
> sequences, is translated into UTF-32 in a way that the original
> octet sequence can be restored. 

I have always argued strongly against this idea, and always will.

Far from solving the stated problem, it would introduce a new one:
conversion from the "bad data" Unicode code points, currently
well-defined, would become ambiguous.

Suppose the block U+EFFxx were assigned to invalid UTF-8 bytes .
Then there would be two possible conversions from, for instance,
U+EFF80: either <80> or .

Declaring the "special" code points to be excluded from straightforward
UTF-* conversion would invalidate every existing UTF-* processor, and
would be widely ignored.

File systems cannot have it both ways: they must define file names
either as unrestricted sequences of bytes, or as strings of characters
in some defined encoding. If they choose the latter, they need to define
conversion mechanisms with suitable fallback and adhere to them. They
can use the PUA if they like. 

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Alastair Houghton via Unicode


> On 16 May 2017, at 20:43, Richard Wordingham via Unicode 
>  wrote:
> 
> On Tue, 16 May 2017 11:36:39 -0700
> Markus Scherer via Unicode  wrote:
> 
>> Why do we care how we carve up an illegal sequence into subsequences?
>> Only for debugging and visual inspection. Maybe some process is using
>> illegal, overlong sequences to encode something special (à la Java
>> string serialization, "modified UTF-8"), and for that it might be
>> convenient too to treat overlong sequences as single errors.
> 
> I think that's not quite true.  If we are moving back and forth through
> a buffer containing corrupt text, we need to make sure that moving three
> characters forward and then three characters back leaves us where we
> started.  That requires internal consistency.

That’s very true.  But the proposed change doesn’t actually affect that; it’s 
still the case that you can correctly identify boundaries in both directions.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Henri Sivonen via Unicode

On Tue, May 16, 2017 at 9:36 PM, Markus Scherer  wrote:
> Let me try to address some of the issues raised here.

Thank you.

> The proposal changes a recommendation, not a requirement.

This is a very bad reason in favor of the change. If anything, this
should be a reason why there is no need to change the spec text.

> Conformance
> applies to finding and interpreting valid sequences properly. This includes
> not consuming parts of valid sequences when dealing with illegal ones, as
> explained in the section "Constraints on Conversion Processes".
>
> Otherwise, what you do with illegal sequences is a matter of what you think
> makes sense -- a matter of opinion and convenience. Nothing more.

This may be the Unicode-level view of error handling. It isn't the
Web-level view of error handling. In the world of Web standards (i.e.
standards that read on the behavior of browsers engines), we've
learned that implementation-defined behavior is bad, because someone
makes a popular site that depends on the implementation-defined
behavior of the browser they happened to test in. For this reason, the
WHATWG has since 2004 written specs that are well-defined even in
corner cases and for non-conforming input, and we've tried to extend
this culture into the W3C, too. (Sometimes, exceptions are made when
there's a very good reason to handle a corner case differently in a
given implementatino: A recent example is CSS allowing the
non-preservation of lone surrogates entering the CSS Object Model via
JavaScript strings in order to enable CSS Object Model implementations
that use UTF-8 [really UTF-8 and not some almost-UTF-8 variant]
internally. But, yes, we really do sweat the details on that level.)

Even if one could argue that implementation-defined behavior on the
topic of number of U+FFFDs for ill-formed sequences in UTF-8 decode
doesn't matter, the WHATWG way of doing things isn't to debate whether
implementation-defined behavior matters in this particular case but to
require one particular behavior in order to have well-defined behavior
even when input is non-conforming.

It further seems that there are people who do care about what's a
*requirement* on the WHATWG level matching what's "best practice" on
the Unicode level:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=19938

Now that major browsers agree, knowing what I know about how the
WHATWG operates, while I can't speak for Anne, I expect the WHATWG
spec to say as-is, because it now matches the browser consensus.

So as a practical matter, if Unicode now changes its "best practice",
when people check consistency with Unicode-level "best practice" and
notice a discrepancy, the WHATWG and developers of implementations
that took the previously-stated "best practice" seriously (either
directly or by the means of another spec, like the WHATWG Encoding
Standard, elevating it to a *requirement*) will need to explain why
they don't follow the best practice.

It is really inappropriate to inflict that trouble onto pretty much
everyone except ICU when the rationale for change is as flimsy as
"feels right". And, as noted earlier, politically it looks *really
bad* for Unicode to change its own previous recommendation to side
with ICU not following it when a number of other prominent
implementations do.

> I believe that the discussion of how to handle illegal sequences came out of
> security issues a few years ago from some implementations including valid
> single and lead bytes with preceding illegal sequences.
...
> Why do we care how we carve up an illegal sequence into subsequences? Only
> for debugging and visual inspection.
...
> If you don't like some recommendation, then do something else. It does not
> matter. If you don't reject the whole input but instead choose to replace
> illegal sequences with something, then make sure the something is not
> nothing -- replacing with an empty string can cause security issues.
> Otherwise, what the something is, or how many of them you put in, is not
> very relevant. One or more U+FFFDs is customary.

When the recommendation came about for security reasons, it's a really
bad idea that to suggest that implementors should decide on their own
what to do and trust that their decision deviates little enough from
the suggestion to stay on the secure side. To be clear, I'm not, at
this time, claiming that the number of U+FFFDs has a security
consequence as long as the number is at least one, but there's an
awfully short slippery slope to giving the caller of a converter API
the option to "ignore errors", i.e. make the number zero, which *is*,
as you note, a security problem.

> When the current recommendation came in, I thought it was reasonable but
> didn't like the edge cases. At the time, I didn't think it was important to
> twiddle with the text in the standard, and I didn't care that ICU didn't
> exactly implement that particular recommendation.

If ICU doesn't care, then it should be ICU developers and

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Another alternative for you API is to not return simple integer values, but
return (read-only) instances of a Char32 class whose "scalar" property
would normally be a valid codepoint with scalar value, or whose "string"
property will be the actual character; but with another static property
"isValidScalar" returning "true"; for other ill-formed
sequences,"isValidScalar" will be false, the scalar value will be the
initial code unit from the input (decoded from the internal representation
in tyhe backing store) and the "string" property will be empty. You may
also add a special "Char32" static instance representing
end-of-file/end-of-string, whose property "isEOF" will be true, and
property scalar will be typically -1, "isValid Scalar" will be false, and
the "string" property will be the empty string.

All this is possible independantly of the internal representation made in
the backing store for its own code units (where it may use any extension of
standard UTF's or any data compression scheme without exposing it)

2017-05-16 23:08 GMT+02:00 Philippe Verdy :

>
>
> 2017-05-16 20:50 GMT+02:00 Shawn Steele :
>
>> But why change a recommendation just because it “feels like”.  As you
>> said, it’s just a recommendation, so if that really annoyed someone, they
>> could do something else (eg: they could use a single FFFD).
>>
>>
>>
>> If the recommendation is truly that meaningless or arbitrary, then we
>> just get into silly discussions of “better” that nobody can really answer.
>>
>>
>>
>> Alternatively, how about “one or more FFFDs?” for the recommendation?
>>
>>
>>
>> To me it feels very odd to perhaps require writing extra code to detect
>> an illegal case.  The “best practice” here should maybe be “one or more
>> FFFDs, whatever makes your code faster”.
>>
>
> Faster ok, privided this does not break other uses, notably for  random
> access within strings, where UTF-8 is designed to allow searching backward
> on a limited number of bytes (maximum 3) in order to find the leading byte,
> and then check its value:
> - if it's not found, return back to the initial position and amke the next
> access return U+FFFD to signal the error of position: this trailing byte is
> part of an ill-formed sequence, and for coherence, any further trailine
> bytes fouind after it will **also** return U+FFFD to be coherent (because
> these other trailing bytes may also be found bby random access to them.
> - it the leading byte is found backward ut does not match the expected
> number of trailing bytes after it, return back to the initial random
> position where you'll return also U+FFFD. This means that the initial
> leading byte (part of the ill-formed sequence) must also return a separate
> U+FFFD, given that each following trailing byte will return U+FFFD
> isolately when accessing to them.
>
> If we want coherent decoding with text handling promitives allowing random
> access with encoded sequences, there's no other choice than treating EACH
> byte part of the ill-formed sequence as individual errors mapped to the
> same replacement code point (U+FFFD if that is what is chosen, but these
> APIs could as well specify annother replacement character or could
> eventually return a non-codepoint if the API return value is not restricted
> to only valid codepoints (for example the replacement could be a negative
> value whose absolute value matches the invalid code unit, or some other
> invalid code unit outside the valid range for code points with scalar
> values: isolated surrogates in UTF-16 for example could be returned as is,
> or made negative either by returning its opposite or by setting (or'ing)
> the most significant bit of the return value).
>
> The problem will arise when you need to store the replacement values if
> the internal backing store is limited to 16-bit code units or 8-bit code
> units: this internal backing store may use its own internal extension of
> standard UTF's, including the possibility of encoding NULLs as C0,80 (like
> what Java does with its "modified UTF-8 internal encoding used in its
> compiled binary classes and serializations), or internally using isolated
> trailing surrogates to store illformed UTF-8 input by or'ing these bytes
> with 0xDC00 that will be returned as an code point with no valid scalar
> value. For internally representing illformed UTF-16 sequences, there's no
> need to change anything. For internally representing ill-formed UTF-32
> sequences (in fact limited to one 32-bitcode unit), with a 16bit internal
> backing store you may need to store 3 16bit values (three isolated trailing
> surrogates). For internally representing ill formed UTF-32 in an 8 bit
> backing store, you could use 0xC1 followed by 5 five trailing bytes (each
> one storing 7 bits of the initial ill-formed code unit from the UTF-32
> input).
>
> What you'll do in the internal backing store will not be exposed to your
> API which will just return either valide

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

> Faster ok, privided this does not break other uses, notably for  random 
> access within strings…

Either way, this is a “recommendation”.  I don’t see how that can provide for 
not-“breaking other uses.”  If it’s internal, you can do what you will, so if 
you need the 1:1 seeming parity, then you can do that internally.  But if 
you’re depending on other APIs/libraries/data source/whatever, it would seem 
like you couldn’t count on that.  (And probably shouldn’t even if it was a 
requirement rather than a recommendation).

I’m wary of the idea of attempting random access on a stream that is also 
manipulating the stream at the same time (decoding apparently).

The U+FFFD emitted by this decoding could also require a different # of bytes 
to reencode.  Which might disrupt the presumed parity, depending on how the 
data access was being handled.

-Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 20:50 GMT+02:00 Shawn Steele :

> But why change a recommendation just because it “feels like”.  As you
> said, it’s just a recommendation, so if that really annoyed someone, they
> could do something else (eg: they could use a single FFFD).
>
>
>
> If the recommendation is truly that meaningless or arbitrary, then we just
> get into silly discussions of “better” that nobody can really answer.
>
>
>
> Alternatively, how about “one or more FFFDs?” for the recommendation?
>
>
>
> To me it feels very odd to perhaps require writing extra code to detect an
> illegal case.  The “best practice” here should maybe be “one or more FFFDs,
> whatever makes your code faster”.
>

Faster ok, privided this does not break other uses, notably for  random
access within strings, where UTF-8 is designed to allow searching backward
on a limited number of bytes (maximum 3) in order to find the leading byte,
and then check its value:
- if it's not found, return back to the initial position and amke the next
access return U+FFFD to signal the error of position: this trailing byte is
part of an ill-formed sequence, and for coherence, any further trailine
bytes fouind after it will **also** return U+FFFD to be coherent (because
these other trailing bytes may also be found bby random access to them.
- it the leading byte is found backward ut does not match the expected
number of trailing bytes after it, return back to the initial random
position where you'll return also U+FFFD. This means that the initial
leading byte (part of the ill-formed sequence) must also return a separate
U+FFFD, given that each following trailing byte will return U+FFFD
isolately when accessing to them.

If we want coherent decoding with text handling promitives allowing random
access with encoded sequences, there's no other choice than treating EACH
byte part of the ill-formed sequence as individual errors mapped to the
same replacement code point (U+FFFD if that is what is chosen, but these
APIs could as well specify annother replacement character or could
eventually return a non-codepoint if the API return value is not restricted
to only valid codepoints (for example the replacement could be a negative
value whose absolute value matches the invalid code unit, or some other
invalid code unit outside the valid range for code points with scalar
values: isolated surrogates in UTF-16 for example could be returned as is,
or made negative either by returning its opposite or by setting (or'ing)
the most significant bit of the return value).

The problem will arise when you need to store the replacement values if the
internal backing store is limited to 16-bit code units or 8-bit code units:
this internal backing store may use its own internal extension of standard
UTF's, including the possibility of encoding NULLs as C0,80 (like what Java
does with its "modified UTF-8 internal encoding used in its compiled binary
classes and serializations), or internally using isolated trailing
surrogates to store illformed UTF-8 input by or'ing these bytes with 0xDC00
that will be returned as an code point with no valid scalar value. For
internally representing illformed UTF-16 sequences, there's no need to
change anything. For internally representing ill-formed UTF-32 sequences
(in fact limited to one 32-bitcode unit), with a 16bit internal backing
store you may need to store 3 16bit values (three isolated trailing
surrogates). For internally representing ill formed UTF-32 in an 8 bit
backing store, you could use 0xC1 followed by 5 five trailing bytes (each
one storing 7 bits of the initial ill-formed code unit from the UTF-32
input).

What you'll do in the internal backing store will not be exposed to your
API which will just return either valide codepoints with valid scalar
values, or values outside the two valid subranges (so it could possibly
negative values, or isolated trailing surrogates). That backing store can
also substitute some valid input causing problems (such as NULLs) using
0xC0 plus another byte, that sequence being unexposed by your API which
will still be able to return the expected codepoints (but with the minor
caveat that the total number of returned codepoints will not match the
actual size allocated for the internal backing store (that applications
using that API won't even need to know how it is internally represented).

In other words: any private extensions are possible internally, but it is
possible to isolate it within a blackboxing API which will still be able to
chose how to represent the input text (it may as well use a zlib-compressed
backing store, or some stateless Huffmann compression based on a static
statistic table configured and stored elsewhere, intiialized when you first
instantiate your API).

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode

On Tue, 16 May 2017 11:36:39 -0700
Markus Scherer via Unicode  wrote:

> Why do we care how we carve up an illegal sequence into subsequences?
> Only for debugging and visual inspection. Maybe some process is using
> illegal, overlong sequences to encode something special (à la Java
> string serialization, "modified UTF-8"), and for that it might be
> convenient too to treat overlong sequences as single errors.

I think that's not quite true.  If we are moving back and forth through
a buffer containing corrupt text, we need to make sure that moving three
characters forward and then three characters back leaves us where we
started.  That requires internal consistency.

One possible issue is with text input methods that access an
application's backing store.  They can issue updates in the form of
'delete 3 characters and insert ...'.  However, if the input method is
accessing characters it hasn't written, it's probably misbehaving
anyway.  Such commands do rather heavily assume that any
relevant normalisation by the application will be taken into account by
the input method.  I once had a go at fixing an application that was
misinterpreting 'delete x characters' as 'delete x UTF-16 code units'.
It was a horrible mess, as the application's interface layer couldn't
peek at the string being edited.

Richard.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

But why change a recommendation just because it “feels like”.  As you said, 
it’s just a recommendation, so if that really annoyed someone, they could do 
something else (eg: they could use a single FFFD).

If the recommendation is truly that meaningless or arbitrary, then we just get 
into silly discussions of “better” that nobody can really answer.

Alternatively, how about “one or more FFFDs?” for the recommendation?

To me it feels very odd to perhaps require writing extra code to detect an 
illegal case.  The “best practice” here should maybe be “one or more FFFDs, 
whatever makes your code faster”.

Best practices may not be requirements, but people will still take time to file 
bugs that something isn’t following a “best practice”.

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Markus Scherer 
via Unicode
Sent: Tuesday, May 16, 2017 11:37 AM
To: Alastair Houghton <alast...@alastairs-place.net>
Cc: Philippe Verdy <verd...@wanadoo.fr>; Henri Sivonen <hsivo...@hsivonen.fi>; 
unicode Unicode Discussion <unicode@unicode.org>; Hans Åberg 
<haber...@telia.com>
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

Let me try to address some of the issues raised here.

The proposal changes a recommendation, not a requirement. Conformance applies 
to finding and interpreting valid sequences properly. This includes not 
consuming parts of valid sequences when dealing with illegal ones, as explained 
in the section "Constraints on Conversion Processes".

Otherwise, what you do with illegal sequences is a matter of what you think 
makes sense -- a matter of opinion and convenience. Nothing more.

I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU 
team. At the time, I believe the ISO UTF-8 definition was not yet limited to 
U+10, and decoding overlong sequences and those yielding surrogate code 
points was regarded as a misdemeanor. The spec has been tightened up, but I am 
pretty sure that most people familiar with how UTF-8 came about would recognize 
 and  as single sequences.

I believe that the discussion of how to handle illegal sequences came out of 
security issues a few years ago from some implementations including valid 
single and lead bytes with preceding illegal sequences. Beyond the "Constraints 
on Conversion Processes", there was evidently also a desire to recommend how to 
handle illegal sequences.

I think that the current recommendation was an extrapolation of common practice 
for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, 
but "it feels like" (yes, that's the level of argument for stuff that doesn't 
really matter) not treating  and  as single sequences is 
"weird".

Why do we care how we carve up an illegal sequence into subsequences? Only for 
debugging and visual inspection. Maybe some process is using illegal, overlong 
sequences to encode something special (à la Java string serialization, 
"modified UTF-8"), and for that it might be convenient too to treat overlong 
sequences as single errors.

If you don't like some recommendation, then do something else. It does not 
matter. If you don't reject the whole input but instead choose to replace 
illegal sequences with something, then make sure the something is not nothing 
-- replacing with an empty string can cause security issues. Otherwise, what 
the something is, or how many of them you put in, is not very relevant. One or 
more U+FFFDs is customary.

When the current recommendation came in, I thought it was reasonable but didn't 
like the edge cases. At the time, I didn't think it was important to twiddle 
with the text in the standard, and I didn't care that ICU didn't exactly 
implement that particular recommendation.

I have seen implementations that clobber every byte in an illegal sequence with 
a space, because it's easier than writing an U+FFFD for each byte or for some 
subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long 
illegal subsequence; that's fine, too.

Karl Williamson sent feedback to the UTC, "In short, I believe the best 
practices are wrong." I think "wrong" is far too strong, but I got an action 
item to propose a change in the text. I proposed a modified recommendation. 
Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" 
that was "right".

None of this is motivated by which UTF is used internally.

It is true that it takes a tiny bit more thought and work to recognize a wider 
set of sequences, but a capable implementer will optimize successfully for 
valid sequences, and maybe even for a subset of those for what might be 
expected high-frequency code point ranges. Error handling can go into a slow 
path. In a true state table implementation, it will require more states but 
should not affect the p

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode

On 16 May 2017, at 19:36, Markus Scherer  wrote:
> 
> Let me try to address some of the issues raised here.

Thanks for jumping in.

The one thing I wanted to ask about was the “without ever restricting trail 
bytes to less than 80..BF”.  I think that could be misinterpreted; having 
thought about it some more, I think you mean “considering any trailing byte in 
the range 80..BF as valid”.  The “less than” threw me the first few times I 
read it and I started thinking you meant allowing any byte as a trailing byte, 
which is clearly not right.

Otherwise, I’m happy :-)

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Markus Scherer via Unicode

Let me try to address some of the issues raised here.

The proposal changes a recommendation, not a requirement. Conformance
applies to finding and interpreting valid sequences properly. This includes
not consuming parts of valid sequences when dealing with illegal ones, as
explained in the section "Constraints on Conversion Processes".

Otherwise, what you do with illegal sequences is a matter of what you think
makes sense -- a matter of opinion and convenience. Nothing more.

I wrote my first UTF-8 handling code some 18 years ago, before joining the
ICU team. At the time, I believe the ISO UTF-8 definition was not yet
limited to U+10, and decoding overlong sequences and those yielding
surrogate code points was regarded as a misdemeanor. The spec has been
tightened up, but I am pretty sure that most people familiar with how UTF-8
came about would recognize  and  as single sequences.

I believe that the discussion of how to handle illegal sequences came out
of security issues a few years ago from some implementations including
valid single and lead bytes with preceding illegal sequences. Beyond the
"Constraints on Conversion Processes", there was evidently also a desire to
recommend how to handle illegal sequences.

I think that the current recommendation was an extrapolation of common
practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for
UTF-8, too, but "it feels like" (yes, that's the level of argument for
stuff that doesn't really matter) not treating  and  as
single sequences is "weird".

Why do we care how we carve up an illegal sequence into subsequences? Only
for debugging and visual inspection. Maybe some process is using illegal,
overlong sequences to encode something special (à la Java string
serialization, "modified UTF-8"), and for that it might be convenient too
to treat overlong sequences as single errors.

If you don't like some recommendation, then do something else. It does not
matter. If you don't reject the whole input but instead choose to replace
illegal sequences with something, then make sure the something is not
nothing -- replacing with an empty string can cause security issues.
Otherwise, what the something is, or how many of them you put in, is not
very relevant. One or more U+FFFDs is customary.

When the current recommendation came in, I thought it was reasonable but
didn't like the edge cases. At the time, I didn't think it was important to
twiddle with the text in the standard, and I didn't care that ICU didn't
exactly implement that particular recommendation.

I have seen implementations that clobber every byte in an illegal sequence
with a space, because it's easier than writing an U+FFFD for each byte or
for some subsequences. Fine. Someone might write a single U+FFFD for an
arbitrarily long illegal subsequence; that's fine, too.

Karl Williamson sent feedback to the UTC, "In short, I believe the best
practices are wrong." I think "wrong" is far too strong, but I got an
action item to propose a change in the text. I proposed a modified
recommendation. Nothing gets elevated to "right" that wasn't, nothing gets
demoted to "wrong" that was "right".

None of this is motivated by which UTF is used internally.

It is true that it takes a tiny bit more thought and work to recognize a
wider set of sequences, but a capable implementer will optimize
successfully for valid sequences, and maybe even for a subset of those for
what might be expected high-frequency code point ranges. Error handling can
go into a slow path. In a true state table implementation, it will require
more states but should not affect the performance of valid sequences.

Many years ago, I decided for ICU to add a small amount of slow-path
error-handling code for more human-friendly illegal-sequence reporting. In
other words, this was not done out of convenience; it was an inconvenience
that seemed justified by nicer error reporting. If you don't like to do so,
then don't.

Which UTF is better? It depends. They all have advantages and problems.
It's all Unicode, so it's all good.

ICU largely uses UTF-16 but also UTF-8. It has data structures and code for
charset conversion, property lookup, sets of characters (UnicodeSet), and
collation that are co-optimized for both UTF-16 and UTF-8. It has a slowly
growing set of APIs working directly with UTF-8.

So, please take a deep breath. No conformance requirement is being touched,
no one is forced to do something they don't like, no special consideration
is given for one UTF over another.

Best regards,
markus

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 20:01, Philippe Verdy  wrote:
> 
> On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random 
> sequences of 16-bit code units are not permitted. There's visibly a 
> validation step that returns an error if you attempt to create files with 
> invalid sequences (including other restrictions such as forbidding U+ and 
> some other problematic controls).

For it to work the way I suggested, there would be low level routines that 
handles the names raw, and then on top of that, interface routines doing what 
you describe. On the Austin Group List, they mentioned a filesystem doing it 
directly in UTF-16, and it could have been the one you describe.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 19:30 GMT+02:00 Shawn Steele via Unicode :

> C) The data was corrupted by some other means.  Perhaps bad
> concatenations, lost blocks during read/transmission, etc.  If we lost 2
> 512 byte blocks, then maybe we should have a thousand FFFDs (but how would
> we known?)
>

Thousands of U+FFFD's is not a problem (independantly of the internal UTF
encoding used): yes the 2512 byte block could then become 3 times larger
(if using UTF-8 internal encoding) or 2 times larger (if using UTF-16
internal encoding) but every application should be prepared to support the
size expansion with a completely know maximum factor, which could occur as
well with any valid CJK-only text.
So the size to allocate for the internal sorage is predictable from the
size of the input, this is an important feature of all standard UTF's.
Being able to handle the worst case of allowed expansion, militates largely
for the adoption of UTF-16 as the internal encoding, instead of UTF-8
(where you'll need to allocate more space before decoding the input, if you
want to avoid successive memory reallocations, which would impact the
performance of your decoder): it's simple to accept input from 512 bytes
(or 1KB) buffers, and allocate a 1KB (or 2KB) buffer for storing the
intermediate results in the generic decoder, and simpler on the outer level
to preallocate buffers with resonable sizes that will be reallocated once
if needed to the maximum size, and then reduced to the effective size (if
needed) at end of successful decoding (some implementations can use pools
of preallocated buffers with small static sizes, allocating new buffers out
side the pool only for rare cases where more space will be needed)
.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Asmus Freytag via Unicode

On 5/16/2017 10:30 AM, Shawn Steele via
Unicode wrote:

Would you advocate replacing

e0 80 80

with

U+FFFD U+FFFD U+FFFD (1)

rather than

U+FFFD (2)

It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t
want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t
see the logic in insisting that it must be decoded to *three* code points when it clearly only
represented one in the input.

It is not at all clear what the intent of the encoder was - or even if it's not just a problem with the data stream. E0 80 80 is not permitted, it's garbage. An encoder can't "intend" it.

Either
A) the "encoder" was attempting to be malicious, in which case the whole thing is suspect and garbage, and so the # of FFFD's doesn't matter, or

B) the "encoder" is completely broken, in which case all bets are off, again, specifying the # of FFFD's is irrelevant.

C) The data was corrupted by some other means. Perhaps bad concatenations, lost blocks during read/transmission, etc. If we lost 2 512 byte blocks, then maybe we should have a thousand FFFDs (but how would we known?)

-Shawn

Clearly, for the receiver, nothing reliable
can be deduced about the raw byte stream once an FFFD has been
inserted.
For the receiver, there's a fourth case that
might have been:

D) the raw UTF-8 stream contained a valid U+FFFD

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Regardless, it's not legal and hasn't been legal for quite some time.  
Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to 
anything depending on that fake-null, so one or three isn't really going to 
matter.

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham via Unicode
Sent: Tuesday, May 16, 2017 10:58 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On Tue, 16 May 2017 17:30:01 +
Shawn Steele via Unicode <unicode@unicode.org> wrote:

> > Would you advocate replacing
> 
> >   e0 80 80
> 
> > with
> 
> >   U+FFFD U+FFFD U+FFFD (1)  
> 
> > rather than
> 
> >   U+FFFD   (2)  
> 
> > It’s pretty clear what the intent of the encoder was there, I’d say, 
> > and while we certainly don’t want to decode it as a NUL (that was 
> > the source of previous security bugs, as I recall), I also don’t see 
> > the logic in insisting that it must be decoded to *three* code 
> > points when it clearly only represented one in the input.
> 
> It is not at all clear what the intent of the encoder was - or even if 
> it's not just a problem with the data stream.  E0 80 80 is not 
> permitted, it's garbage.  An encoder can't "intend" it.

It was once a legal way of encoding NUL, just like C0 E0, which is still in 
use, and seems to be the best way of storing NUL as character content in a *C 
string*.  (Strictly speaking, one can't do it.)  It could be lurking in old 
text or come from an old program that somehow doesn't get used for U+0080 to 
U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of 
converting UTF-16 to UTF-8.

Remember the conformance test for the Unicode Collation Algorithm has contained 
lone surrogates in the past, and the UAX on Unicode Regular Expressions used to 
require the ability to search for lone surrogates.

Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random
sequences of 16-bit code units are not permitted. There's visibly a
validation step that returns an error if you attempt to create files with
invalid sequences (including other restrictions such as forbidding U+
and some other problematic controls).

This occurs because the NTFS and FAT driver will also attempt to normalize
the string in order to create compatibility 8.3 filenames using the
system's native locale (not the current user locale which is used when
searching files/enumerating directories or opening files - this could
generate errors when the encodings for distinct locales do not match, but
should not cause errors when filenames are **first** searched in their
UTF-16 encoding specified in applications, but applications that still need
to access files using their short name are deprecated). The kind of
normalization taken for creating short 8.3 filenames uses OS-specific
specific conversion tables built in the filesystem drivers. This generation
however has a cost due to the uniqueness constraints (requiring to
abbreviate the first part of the 8.3 name to add "~numbered" suffixes
before the extension, whose value is unpredicatable if there are other
existing "*~1.*" files: it requires the driver to retry with another
number, looping if necessary). This also has a (very modest) storage cost
but it is less critical than the enumeration step and the fact that these
shortened name cannot be predicted by applications.

This canonicalization is also required also because the filesystem is
case-insensitive (and it's technically not possible to store all the
multiple case variants for filenames as assigned aliases/physical links).
In classic filesystems for Unix/Linux the only restrictions are on
forbidding null bytes, and assigning "/" a role for hierarchic filesystems
(unusable anywhere as directory entry name), plus the preservation of "."
and ".." entries in directories, meaning that only 8-bit encodings based on
7-bit ASCII are possible, so Linux/Unix are not completely treating thes
filenames as pure binary bags of bytes (however if this is not checked and
such random names may occur, which will be difficult to handle with classic
tools and shells). Some other filesystems for Linux/Unix are still
enforcing restrictions (and there exists even versions of them that are
supporting case insensitity, in addition to FAT12/FAT16/FAT32/exFAT/NTFS
emulated filesystems: this also exists in NFS driver as an option, or in
drivers for legacy filesystems initially coming from mainframes, or in
filesystem drivers based on FTP, and even in the filesystem driver allowing
to mount a Windows registry which is also case-insensitive).

Technically in the core kernel of Linux/Unix there's no restriction on the
effective encoding (except "/" and null), the actual restrictions are
implemented within filesystem drivers, configured only when volumes are
mounted: each mounted filesystem can then have its own internal encoding;
there will be different behaviors when using a driver for any MacOS
filesystem.

Linux can perfectly work with NTFS filesystems, except that most of the
time, short filenames will be completely ignored and not generated on the
fly.

This generation of short filenames in a legacy (unspecified) 8-bit codepage
is not a requirement of NTFS and it can be disabled also in Windows.

But FAT12/FAT16/FAT32 still require these legacy short names to be
generated when only the LFN could be used, and the short 8.3 name left
completely null in the main directory entry ; but legacy FAT drivers will
shoke on these null entries, if they are not tagged by a custom attribute
bit as "ignorable but not empty", or if the 8+3 characters do not use
specific unique parterns such as "\" followed by 7 pseudo-random characters
in the main part, plus 3 other pseudo-random characters in the extension
(these 10 characters may use any non null value: they provide nearly 80
bits or more exactly 250^10 identifiers if we exclude the 6 characters "/",
"\", ".", ":" NULL and SPACE that are reserved, which could be generated
almost predictably simply by hashing the original unabbreviated name with
79 bits from SHA-128, or faster with simple MD5 hahsing, and very rare
remaining collisions to handle).

Some FAT repait tools will attempt to repair the legacy short filenames
that are not unique or cannot be derived from the UTF-16 encoded LFN (this
happens when "repairing" a FAT volume initially created on another system
that used a different 8-bit OEM codepage, but this "CheckDisk" tools should
have an option to not "repair" them, given that modern applications
normally do not need these filenames if a LFN is present (even the Windows
Explorer will not display these short names because trhey are hidden by
default each time there's a LFN which overrides them).

We must add however that on FAT filesystems, a LFN will not always be
stored if the Unicode name already has

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode

On Tue, 16 May 2017 17:30:01 +
Shawn Steele via Unicode  wrote:

> > Would you advocate replacing  
> 
> >   e0 80 80  
> 
> > with  
> 
> >   U+FFFD U+FFFD U+FFFD (1)  
> 
> > rather than  
> 
> >   U+FFFD   (2)  
> 
> > It’s pretty clear what the intent of the encoder was there, I’d
> > say, and while we certainly don’t want to decode it as a NUL (that
> > was the source of previous security bugs, as I recall), I also
> > don’t see the logic in insisting that it must be decoded to *three*
> > code points when it clearly only represented one in the input.  
> 
> It is not at all clear what the intent of the encoder was - or even
> if it's not just a problem with the data stream.  E0 80 80 is not
> permitted, it's garbage.  An encoder can't "intend" it.

It was once a legal way of encoding NUL, just like C0 E0, which is
still in use, and seems to be the best way of storing NUL as character
content in a *C string*.  (Strictly speaking, one can't do it.)  It
could be lurking in old text or come from an old program that somehow
doesn't get used for U+0080 to U+07FF. Converting everything in UCS-2
to 3 bytes was an easily encoded way of converting UTF-16 to UTF-8.

Remember the conformance test for the Unicode Collation Algorithm has
contained lone surrogates in the past, and the UAX on Unicode Regular
Expressions used to require the ability to search for lone surrogates.

Richard.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8