Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Wed, 31 May 2017 19:24:04 +
Shawn Steele via Unicode  wrote:

> It seems to me that being able to use a data stream of ambiguous
> quality in another application with predictable results, then that
> stream should be “repaired” prior to being handed over.  Then both
> endpoints would be using the same set of FFFDs, whether that was
> single or multiple forms.

This of course depends on where the damage is being done.  You're
urging that applications check the strings they have generated as they
export them.

Richard.




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> And *that* is what the specification says.  The whole problem here is that 
> someone elevated
> one choice to the status of “best practice”, and it’s a choice that some of 
> us don’t think *should*
> be considered best practice.

> Perhaps “best practice” should simply be altered to say that you *clearly 
> document* your behavior
> in the case of invalid UTF-8 sequences, and that code should not rely on the 
> number of U+FFFDs 
> generated, rather than suggesting a behaviour?

That's what I've been suggesting.

I think we could maybe go a little further though:

* Best practice is clearly not to depend on the # of U+FFFDs generated by 
another component/app.  Clearly that can't be relied upon, so I think everyone 
can agree with that.
* I think encouraging documentation of behavior is cool, though there are 
probably low priority bugs and people don't like to read the docs in that 
detail, so I wouldn't expect very much from that.
* As far as I can tell, there are two (maybe three) sane approaches to this 
problem:
* Either a "maximal" emission of one U+FFFD for every byte that exists 
outside of a good sequence 
* Or a "minimal" version that presumes the lead byte was counting trail 
bytes correctly even if the resulting sequence was invalid.  In that case just 
use one U+FFFD.
* And (maybe, I haven't heard folks arguing for this one) emit one 
U+FFFD at the first garbage byte and then ignore the input until valid data 
starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred 
garbage bytes as long as there weren't any valid sequences within that group).
* I'd be happy if the best practice encouraged one of those two (or maybe 
three) approaches.  I think an approach that called rand() to see how many 
U+FFFDs to emit when it encountered bad data is fair to discourage.

-Shawn



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Doug Ewell via Unicode
Henri Sivonen wrote:

> If anything, I hope this thread results in the establishment of a
> requirement for proposals to come with proper research about what
> multiple prominent implementations to about the subject matter of a
> proposal concerning changes to text about implementation behavior.

Considering that several folks have objected that the U+FFFD
recommendation is perceived as having the weight of a requirement, I
think adding Henri's good advice above as a "requirement" seems
heavy-handed. Who will judge how much research qualifies as "proper"?
Who will determine that the judge doesn't have a conflict?

An alternative would be to require that proposals, once received with
whatever amount of research, are augmented with any necessary additional
research *before* being approved. The identity or reputation of the
requester should be irrelevant to approval.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> it’s more meaningful for whoever sees the output to see a single U+FFFD 
> representing 
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid 
> lead byte and 
> then another for an “unexpected” trailing byte.

I disagree.  It may be more meaningful for some applications to have a single 
U+FFFD representing an illegally encoded 2-byte NULL than to have 2 U+FFFDs.  
Of course then you don't know if it was an illegally encoded 2-byte NULL or an 
illegally encoded 3-byte NULL or whatever, so some information that other 
applications may be interested in is lost.

Personally, I prefer the "emit a U+FFFD if the sequence is invalid, drop the 
byte, and try again" approach.  

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> For implementations that emit FFFD while handling text conversion and repair 
> (ie, converting ill-formed
> UTF-8 to well-formed), it is best for interoperability if they get the same 
> results, so that indices within the
> resulting strings are consistent across implementations for all the correct 
> characters thereafter.

That seems optimistic :)

If interoperability is the goal, then it would seem to me that changing the 
recommendation would be contrary to that goal.  There are systems that will not 
or cannot change to a new recommendation.  If such systems are updated, then 
adoption of those systems will likely take some time.

In other words, I cannot see where “consistency across implementations” would 
be achievable anytime in the near future.

It seems to me that being able to use a data stream of ambiguous quality in 
another application with predictable results, then that stream should be 
“repaired” prior to being handed over.  Then both endpoints would be using the 
same set of FFFDs, whether that was single or multiple forms.


-Shawn


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Mark Davis ☕️ via Unicode
> I do not understand the energy being invested in a case that shouldn't
happen, especially in a case that is a subset of all the other bad cases
that could happen.

I think Richard stated the most compelling reason:

… The bug you mentioned arose from two different ways of
counting the string length in 'characters'.  Having two different
'character' counts for the same string is inviting trouble.


For implementations that emit FFFD while handling text conversion and
repair (ie, converting ill-formed UTF-8 to well-formed), it is best for
interoperability if they get the same results, so that indices within the
resulting strings are consistent across implementations for all the
*correct* characters thereafter.

It would be preferable *not* to have the following:

source = %c0%80abc

Vendor 1:
fixed = fix(source)
fixed == ​�abc
codepointAt(fixed, 3) == 'b'

Vendor2:
fixed = fix(source)
fixed == ��abc
codepointAt(fixed, 3) =
​=​
'
​c
'

In theory one could just throw an exception. In practice, nobody wants
their browser
​​
to belly up on a webpage with a component that has an ill-formed bit of
UTF-8.

I
n theory one could document everyone's flavor of the month for how many
FFFD's to emit. In practice, that falls apart immediately, since in today's
interconnected world you can't tell which processes get first crack at text
repair.

Mark

On Wed, May 31, 2017 at 7:43 PM, Shawn Steele via Unicode <
unicode@unicode.org> wrote:

> > > In either case, the bad characters are garbage, so neither approach is
> > > "better" - except that one or the other may be more conducive to the
> > > requirements of the particular API/application.
>
> > There's a potential issue with input methods that indirectly edit the
> backing store.  For example,
> > GTK input methods (e.g. function gtk_im_context_delete_surrounding())
> can delete an amount
> > of text specified in characters, not storage units.  (Deletion by
> storage units is not available in this
> > interface.)  This might cause utter confusion or worse if the backing
> store starts out corrupt.
> > A corrupt backing store is normally manually correctable if most of the
> text is ASCII.
>
> I think that's sort of what I said: some approaches might work better for
> some systems and another approach might work better for another system.
> This also presupposes a corrupt store.
>
> It is unclear to me what the expected behavior would be for this
> corruption if, for example, there were merely a half dozen 0x80 in the
> middle of ASCII text?  Is that garbage a single "character"?  Perhaps
> because it's a consecutive string of bad bytes?  Or should it be 6
> characters since they're nonsense?  Or maybe 2 characters because the
> maximum # of trail bytes we can have is 3?
>
> What if it were 2 consecutive 2-byte sequence lead bytes and no trail
> bytes?
>
> I can see how different implementations might be able to come up with
> "rules" that would help them navigate (or clean up) those minefields,
> however it is not at all clear to me that there is a "best practice" for
> those situations.
>
> There also appears to be a special weight given to non-minimally-encoded
> sequences.  It would seem to me that none of these illegal sequences should
> appear in practice, so we have either:
>
> * A bad encoder spewing out garbage (overlong sequences)
> * Flipped bit(s) due to storage/transmission/whatever errors
> * Lost byte(s) due to storage/transmission/coding/whatever errors
> * Extra byte(s) due to whatever errors
> * Bad string manipulation breaking/concatenating in the middle of
> sequences, causing garbage (perhaps one of the above 2 codeing errors).
>
> Only in the first case, of a bad encoder, are the overlong sequences
> actually "real".  And that shouldn't happen (it's a bad encoder after
> all).  The other scenarios seem just as likely, (or, IMO, much more likely)
> than a badly designed encoder creating overlong sequences that appear to
> fit the UTF-8 pattern but aren't actually UTF-8.
>
> The other cases are going to cause byte patterns that are less "obvious"
> about how they should be navigated for various applications.
>
> I do not understand the energy being invested in a case that shouldn't
> happen, especially in a case that is a subset of all the other bad cases
> that could happen.
>
> -Shawn
>
>


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode
On 31 May 2017, at 18:43, Shawn Steele via Unicode  wrote:
> 
> It is unclear to me what the expected behavior would be for this corruption 
> if, for example, there were merely a half dozen 0x80 in the middle of ASCII 
> text?  Is that garbage a single "character"?  Perhaps because it's a 
> consecutive string of bad bytes?  Or should it be 6 characters since they're 
> nonsense?  Or maybe 2 characters because the maximum # of trail bytes we can 
> have is 3?

It should be six U+FFFD characters, because 0x80 is not a lead byte.  
Basically, the new proposal is that we should decode bytes that structurally 
match UTF-8, and if the encoding is then illegal (because it’s over-long, 
because it’s a surrogate or because it’s over U+10) then the entire thing 
is replaced with U+FFFD.  If, on the other hand, we get a sequence that isn’t 
structurally valid UTF-8, we replace the maximally *structurally* valid subpart 
with U+FFFD and continue.

> What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?

Then you get two U+FFFDs.

> I can see how different implementations might be able to come up with "rules" 
> that would help them navigate (or clean up) those minefields, however it is 
> not at all clear to me that there is a "best practice" for those situations.

I’m not sure the whole “best practice” thing has been a lot of help here.  
Perhaps we should change it to say “Suggested Handling”, to make quite clear 
that filing a bug report against code that chooses some other option is not 
necessary?

> There also appears to be a special weight given to non-minimally-encoded 
> sequences.

I don’t think that’s true, *although* it *is* true that UTF-8 decoders 
historically tended to allow such things, so one might assume that some 
software out there is generating them for whatever reason.

There are also *deliberate* violations of the minimal length encoding 
specification in some cases (for instance to allow NUL to be encoded in such a 
way that it won’t terminate a C-style string).  Yes, you may retort, that isn’t 
“valid UTF-8”.  Sure.  It *is* useful, though, and it is *in use*.  If a UTF-8 
decoder encounters such a thing, it’s more meaningful for whoever sees the 
output to see a single U+FFFD representing the illegally encoded NUL that it is 
to see two U+FFFDs, one for an invalid lead byte and then another for an 
“unexpected” trailing byte.

Likewise, there are encoders that generate surrogates in UTF-8, which is, of 
course, illegal, but *does* happen.  Again, they can provide reasonable 
justifications for their behaviour (typically they want the default binary sort 
to work the same as for UTF-16 for some reason), and again, replacing a single 
surrogate with U+FFFD rather than multiple U+FFFDs is more helpful to 
whoever/whatever ends up seeing it.

And, of course, there are encoders that are attempting to exploit security 
flaws, which will very definitely generate these kinds of things.

>  It would seem to me that none of these illegal sequences should appear in 
> practice, so we have either:
> 
> * A bad encoder spewing out garbage (overlong sequences)
> * Flipped bit(s) due to storage/transmission/whatever errors
> * Lost byte(s) due to storage/transmission/coding/whatever errors
> * Extra byte(s) due to whatever errors
> * Bad string manipulation breaking/concatenating in the middle of sequences, 
> causing garbage (perhaps one of the above 2 codeing errors).

I see no reason to suppose that the proposed behaviour would function any less 
well in those cases.

> Only in the first case, of a bad encoder, are the overlong sequences actually 
> "real".  And that shouldn't happen (it's a bad encoder after all).

Except some encoders *deliberately* use over-longs, and one would assume that 
since UTF-8 decoders historically allowed this, there will be data “in the 
wild” that has this form.

> The other scenarios seem just as likely, (or, IMO, much more likely) than a 
> badly designed encoder creating overlong sequences that appear to fit the 
> UTF-8 pattern but aren't actually UTF-8.

I’m not sure I agree that flipped bits, lost bytes and extra bytes are more 
likely than a “bad” encoder.  Bad string manipulation is of course prevalent, 
though - there’s no way around that.

> The other cases are going to cause byte patterns that are less "obvious" 
> about how they should be navigated for various applications.

This is true, *however* the new proposed behaviour is in no way inferior to the 
old proposed behaviour in those cases - it’s just different.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode

> On 30 May 2017, at 18:11, Shawn Steele via Unicode  
> wrote:
> 
>> Which is to completely reverse the current recommendation in Unicode 9.0. 
>> While I agree that this might help you fending off a bug report, it would 
>> create chances for bug reports for Ruby, Python3, many if not all Web 
>> browsers,...
> 
> & Windows & .Net
> 
> Changing the behavior of the Windows / .Net SDK is a non-starter.
> 
>> Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows 
>> what it means, but everybody knows they don't exist.
> 
> Yes, this is trying to improve the language for a scenario that CANNOT 
> HAPPEN.  We're trying to optimize a case for data that implementations should 
> never encounter.  It is sort of exactly like optimizing for the case where 
> your data input is actually a dragon and not UTF-8 text.  
> 
> Since it is illegal, then the "at least 1 FFFD but as many as you want to 
> emit (or just fail)" is fine.

And *that* is what the specification says.  The whole problem here is that 
someone elevated one choice to the status of “best practice”, and it’s a choice 
that some of us don’t think *should* be considered best practice.

Perhaps “best practice” should simply be altered to say that you *clearly 
document* your behaviour in the case of invalid UTF-8 sequences, and that code 
should not rely on the number of U+FFFDs generated, rather than suggesting a 
behaviour?

Kind regards,

Alastair.

--
http://alastairs-place.net




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> > In either case, the bad characters are garbage, so neither approach is 
> > "better" - except that one or the other may be more conducive to the 
> > requirements of the particular API/application.

> There's a potential issue with input methods that indirectly edit the backing 
> store.  For example,
> GTK input methods (e.g. function gtk_im_context_delete_surrounding()) can 
> delete an amount 
> of text specified in characters, not storage units.  (Deletion by storage 
> units is not available in this
> interface.)  This might cause utter confusion or worse if the backing store 
> starts out corrupt. 
> A corrupt backing store is normally manually correctable if most of the text 
> is ASCII.

I think that's sort of what I said: some approaches might work better for some 
systems and another approach might work better for another system.  This also 
presupposes a corrupt store.

It is unclear to me what the expected behavior would be for this corruption if, 
for example, there were merely a half dozen 0x80 in the middle of ASCII text?  
Is that garbage a single "character"?  Perhaps because it's a consecutive 
string of bad bytes?  Or should it be 6 characters since they're nonsense?  Or 
maybe 2 characters because the maximum # of trail bytes we can have is 3?

What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?

I can see how different implementations might be able to come up with "rules" 
that would help them navigate (or clean up) those minefields, however it is not 
at all clear to me that there is a "best practice" for those situations.

There also appears to be a special weight given to non-minimally-encoded 
sequences.  It would seem to me that none of these illegal sequences should 
appear in practice, so we have either:

* A bad encoder spewing out garbage (overlong sequences)
* Flipped bit(s) due to storage/transmission/whatever errors
* Lost byte(s) due to storage/transmission/coding/whatever errors
* Extra byte(s) due to whatever errors
* Bad string manipulation breaking/concatenating in the middle of sequences, 
causing garbage (perhaps one of the above 2 codeing errors).

Only in the first case, of a bad encoder, are the overlong sequences actually 
"real".  And that shouldn't happen (it's a bad encoder after all).  The other 
scenarios seem just as likely, (or, IMO, much more likely) than a badly 
designed encoder creating overlong sequences that appear to fit the UTF-8 
pattern but aren't actually UTF-8.

The other cases are going to cause byte patterns that are less "obvious" about 
how they should be navigated for various applications.

I do not understand the energy being invested in a case that shouldn't happen, 
especially in a case that is a subset of all the other bad cases that could 
happen.

-Shawn 



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Wed, 31 May 2017 15:12:12 +0300
Henri Sivonen via Unicode  wrote:

> The write-up mentions
> https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd
> like to draw everyone's attention to that bug, which is real-world
> evidence of a bug arising from two UTF-8 decoders within one product
> handling UTF-8 errors differently.

> Does it matter if a proposal/appeal is submitted as a non-member
> implementor person, as an individual person member or as a liaison
> member? http://www.unicode.org/consortium/liaison-members.html list
> "the Mozilla Project" as a liaison member, but Mozilla-side
> conventions make submitting proposals like this "as Mozilla"
> problematic (we tend to avoid "as Mozilla" statements on technical
> standardization fora except when the W3C Process forces us to make
> them as part of charter or Proposed Recommendation review).

There may well be an advantage to being able to answer any questions on
the proposal at the meeting, especially if it isn't read until the
meeting.

> > The modified text is a set of guidelines, not requirements. So no
> > conformance clause is being changed.  
> 
> I'm aware of this.
> 
> > If people really believed that the guidelines in that section
> > should have been conformance clauses, they should have proposed
> > that at some point.  
> 
> It seems to me that this thread does not support the conclusion that
> the Unicode Standard's expression of preference for the number of
> REPLACEMENT CHARACTERs should be made into a conformance requirement
> in the Unicode Standard. This thread could be taken to support a
> conclusion that the Unicode Standard should not express any preference
> beyond "at least one and at most as many as there were bytes".
> 
> On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode
>  wrote:
> >  In any case, Henri is complaining that it’s too difficult to
> > implement; it isn’t.  You need two extra states, both of which are
> > trivial.  
> 
> I am not claiming it's too difficult to implement. I think it
> inappropriate to ask implementations, even from-scratch ones, to take
> on added complexity in error handling on mere aesthetic grounds. Also,
> I think it's inappropriate to induce implementations already written
> according to the previous guidance to change (and risk bugs) or to
> make the developers who followed the previous guidance with precision
> be the ones who need to explain why they aren't following the new
> guidance.

How straightforward is the FSM for back-stepping?

> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
>  wrote:
> > The UTF-8 conversion code that I wrote for ICU, and apparently the
> > code that various other people have written, collects sequences
> > starting from lead bytes, according to the original spec, and at
> > the end looks at whether the assembled code point is too low for
> > the lead byte, or is a surrogate, or is above 10. Stopping at a
> > non-trail byte is quite natural, and reading the PRI text
> > accordingly is quite natural too.  
> 
> I don't doubt that other people have written code with the same
> concept as ICU, but as far as non-shortest form handling goes in the
> implementations I tested (see URL at the start of this email) ICU is
> the lone outlier.

You should have researched implementations as they were in 2007.

My own code uses the same concept as Markus's ICU code - convert and
check the resulting value is legal for the length.  As a check,
remember that for n > 1, n bytes could represent 2**(5n + 1) values if
overlongs were permitted.

> > Aside from UTF-8 history, there is a reason for preferring a more
> > "structural" definition for UTF-8 over one purely along valid
> > sequences. This applies to code that *works* on UTF-8 strings
> > rather than just converting them. For UTF-8 *processing* you need
> > to be able to iterate both forward and backward, and sometimes you
> > need not collect code points while skipping over n units in either
> > direction -- but your iteration needs to be consistent in all
> > cases. This is easier to implement (especially in fast, short,
> > inline code) if you have to look only at how many trail bytes
> > follow a lead byte, without having to look whether the first trail
> > byte is in a certain range for some specific lead bytes.  
> 
> But the matter at hand is decoding potentially-invalid UTF-8 input
> into a valid in-memory Unicode representation, so later processing is
> somewhat a red herring as being out of scope for this step.

No.  Both lossily converting a UTF-8-like string as a stream of bytes to
scalar values and moving back and forth through the string 'character'
by 'character' imply an ability to count the number of 'characters' in
the string.  The bug you mentioned arose from two different ways of
counting the string length in 'characters'.  Having two different
'character' counts for the same string is 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Henri Sivonen via Unicode
I've researched this more. While the old advice dominates the handling
of non-shortest forms, there is more variation than I previously
thought when it comes to truncated sequences and CESU-8-style
surrogates. Still, the ICU behavior is an outlier considering the set
of implementations that I tested.

I've written up my findings at https://hsivonen.fi/broken-utf-8/

The write-up mentions
https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd
like to draw everyone's attention to that bug, which is real-world
evidence of a bug arising from two UTF-8 decoders within one product
handling UTF-8 errors differently.

On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode
 wrote:
> There is plenty of time for public comment, since it was targeted at Unicode
> 11, the release for about a year from now, not Unicode 10, due this year.
> When the UTC "approves a change", that change is subject to comment, and the
> UTC can always reverse or modify its approval up until the meeting before
> release date. So there are ca. 9 months in which to comment.

What should I read to learn how to formulate an appeal correctly?

Does it matter if a proposal/appeal is submitted as a non-member
implementor person, as an individual person member or as a liaison
member? http://www.unicode.org/consortium/liaison-members.html list
"the Mozilla Project" as a liaison member, but Mozilla-side
conventions make submitting proposals like this "as Mozilla"
problematic (we tend to avoid "as Mozilla" statements on technical
standardization fora except when the W3C Process forces us to make
them as part of charter or Proposed Recommendation review).

> The modified text is a set of guidelines, not requirements. So no
> conformance clause is being changed.

I'm aware of this.

> If people really believed that the guidelines in that section should have
> been conformance clauses, they should have proposed that at some point.

It seems to me that this thread does not support the conclusion that
the Unicode Standard's expression of preference for the number of
REPLACEMENT CHARACTERs should be made into a conformance requirement
in the Unicode Standard. This thread could be taken to support a
conclusion that the Unicode Standard should not express any preference
beyond "at least one and at most as many as there were bytes".

On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode
 wrote:
>  In any case, Henri is complaining that it’s too difficult to implement; it 
> isn’t.  You need two extra states, both of which are trivial.

I am not claiming it's too difficult to implement. I think it
inappropriate to ask implementations, even from-scratch ones, to take
on added complexity in error handling on mere aesthetic grounds. Also,
I think it's inappropriate to induce implementations already written
according to the previous guidance to change (and risk bugs) or to
make the developers who followed the previous guidance with precision
be the ones who need to explain why they aren't following the new
guidance.

On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
 wrote:
> The UTF-8 conversion code that I wrote for ICU, and apparently the code that
> various other people have written, collects sequences starting from lead
> bytes, according to the original spec, and at the end looks at whether the
> assembled code point is too low for the lead byte, or is a surrogate, or is
> above 10. Stopping at a non-trail byte is quite natural, and reading the
> PRI text accordingly is quite natural too.

I don't doubt that other people have written code with the same
concept as ICU, but as far as non-shortest form handling goes in the
implementations I tested (see URL at the start of this email) ICU is
the lone outlier.

> Aside from UTF-8 history, there is a reason for preferring a more
> "structural" definition for UTF-8 over one purely along valid sequences.
> This applies to code that *works* on UTF-8 strings rather than just
> converting them. For UTF-8 *processing* you need to be able to iterate both
> forward and backward, and sometimes you need not collect code points while
> skipping over n units in either direction -- but your iteration needs to be
> consistent in all cases. This is easier to implement (especially in fast,
> short, inline code) if you have to look only at how many trail bytes follow
> a lead byte, without having to look whether the first trail byte is in a
> certain range for some specific lead bytes.

But the matter at hand is decoding potentially-invalid UTF-8 input
into a valid in-memory Unicode representation, so later processing is
somewhat a red herring as being out of scope for this step. I do agree
that if you already know that the data is valid UTF-8, it makes sense
to work from the bit pattern definition only. (E.g. in encoding_rs,
the implementation I've written and that's on track to replacing uconv
in Firefox, UTF-8 decode works 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Fri, 26 May 2017 21:41:49 +
Shawn Steele via Unicode  wrote:

> I totally get the forward/backward scanning in sync without decoding
> reasoning for some implementations, however I do not think that the
> practices that benefit those should extend to other applications that
> are happy with a different practice.

> In either case, the bad characters are garbage, so neither approach
> is "better" - except that one or the other may be more conducive to
> the requirements of the particular API/application.

There's a potential issue with input methods that indirectly edit the
backing store.  For example, GTK input methods (e.g. function
gtk_im_context_delete_surrounding()) can delete an amount of text
specified in characters, not storage units.  (Deletion by storage
units is not available in this interface.)  This might cause utter
confusion or worse if the backing store starts out corrupt.  A corrupt
backing store is normally manually correctable if most of the text is
ASCII.

Richard.