RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Shawn Steele via Unicode
IMO, encodings, particularly ones depending on state such as this, may have 
multiple ways to output the same, or similar, sequences.  When means that 
pretty much any time an encoding transforms data any previous security or other 
validation style checks are no longer valid and any security/validation must be 
checked for again.  I've seen numerous mistakes due to people expecting 
encodings to play nicely, particularly if there are different endpoints that 
may use different implementations with slightly different behaviors.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Henri Sivonen via 
Unicode
Sent: Sunday, August 16, 2020 11:39 PM
To: Mark Davis ☕️ 
Cc: Unicode Public 
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP 
escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there 
>> is no content between two ISO-2022-JP escape sequences from the 
>> WHATWG Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in 
>> that case is not a useful security measure when unnecessary 
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal 
>> > input byte sequence. Instead, it must stop with an error or 
>> > substitute a replacement character (such as U+FFFD (   ) 
>> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See 
>> > also Section 3.5 Deletion of Code Points.) It is important to do 
>> > this not only for byte sequences that encode characters, but also for 
>> > unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next 
>> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 
>> > variants require at least one character in a text segment between 
>> > shift sequences. Security software written to the formal 
>> > specification may not detect malicious text  (for example, "delete" 
>> > with a shift-to-double-byte then an immediate shift-to-ASCII in the 
>> > middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of 
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into 
>> > its ISO-2022-JP decoder algorithm 
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements 
>> > the WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that 
>> > didn't implement this U+FFFD generation behavior (uconv), a bug has 
>> > been logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140
>> > 3
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP 
>> > the unusual and surprising property that concatenating two 
>> > ISO-2022-JP outputs from a conforming encoder can result in a byte 
>> > sequence that is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP 
>> > escape sequence is immediately followed by another ISO-2022-JP 
>> > escape sequence. Chrome and Safari do, but their implementations of 
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's 
>> > decoder implementations generally are informed by the Encoding 
>> > Standard (though the ISO-2022-JP decoder specifically might not be 
>> > yet), and I suspect that Safari's implementation (ICU) is either 
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations, 
>> > obfuscating the ASCII string "delete", could be accomplished by 
>> > alternating between the ASCII and Roman states to that every other 
>> > character is in the ASCII state and the rest of the Roman 

RE: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Shawn Steele via Unicode
I'm not opposed to a sub-bloc for "Modern Hieroglyphs"  

I confess that even though I know nothing about Hieroglyphs, that I find it 
fascinating that such a thoroughly dead script might still be living in some 
way, even if it's only a little bit.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Ken Whistler via 
Unicode
Sent: Thursday, February 13, 2020 12:08 PM
To: Phake Nick 
Cc: unicode@unicode.org
Subject: Re: Egyptian Hieroglyph Man with a Laptop

You want "dubious"?!

You should see the hundreds of strange characters already encoded in the CJK 
*Unified* Ideographs blocks, as recently documented in great detail by Ken 
Lunde:

https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf

Compared to many of those, a hieroglyph of a man (or woman) holding a laptop is 
positively orthodox!

--Ken

On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote:
> Those characters could also be put into another block for the same 
> script similar to how dubious characters in CJK are included by 
> placing them into "CJK Compatibility Ideographs" for round trip 
> compatibility with source encoding.



RE: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Shawn Steele via Unicode
> From the point of view of Unicode, it is simpler: If the character is in use 
> or have had use, it should be included somehow.

That bar, to me, seems too low.  Many things are only used briefly or in a 
private context that doesn't really require encoding.

The hieroglyphs discussion is interesting because it presents them as living 
(in at least some sense) even though they're a historical script.  Apparently 
modern Egyptologists are coopting them for their own needs.  There are lots of 
emoji for professional fields.  In this case since hieroglyphs are pictorial, 
it seems they've blurred the lines between the script and emoji.  Given their 
field, I'd probably do the same thing.

I'm not opposed to the character if Egyptologists use it amongst themselves, 
though it does make me wonder if it belongs in this set?  Are there other 
"modern" hieroglyphs?  (Other than the errors, etc mentioned earlier, but 
rather glyphs that have been invented for modern use).

-Shawn 




RE: Unicode "no-op" Character?

2019-07-03 Thread Shawn Steele via Unicode
I think you're overstating my concern :)

I meant that those things tend to be particular to a certain context and often 
aren't interesting for interchange.  A text editor might find it convenient to 
place word boundaries in the middle of something another part of the system 
thinks is a single unit to be rendered.  At the same time, a rendering engine 
might find it interesting that there's an ff together and want to mark it to be 
shown as a ligature though that text editor wouldn't be keen on that at all.

As has been said, these are private mechanisms for things that individual 
processes find interesting.  It's not useful to mark those for interchange as 
the text editors word breaking marks would interfere with the graphics engines 
glyph breaking marks.  Not to mention the transmission buffer size marks 
originally mentioned, which could be anywhere.

The "right" thing to do here is to use an internal higher level mechanism to 
keep track of these things however the component needs.  That can even be 
interchanged with another component designed to the same principles, via 
mechanisms like the PUA.  However, those components can't expect their private 
mechanisms are useful or harmless to other processes.  

Even more complicated is that, as pointed out by others, it's pretty much 
impossible to say "these n codepoints should be ignored and have no meaning" 
because some process would try to use codepoints 1-3 for some private meaning.  
Another would use codepoint 1 for their own thing, and there'd be a conflict.  

As a thought experiment, I think it's certainly decent to ask the question 
"could such a mechanism be useful?"  It's an intriguing thought and a decent 
hypothesis that this kind of system could be privately useful to an 
application.  I also think that the conversation has pretty much proven that 
such a system is mathematically impossible.  (You can't have a "private" 
no-meaning codepoint that won't conflict with other "private" uses in a public 
space).

It might be worth noting that this kind of thing used to be fairly common in 
early computing.  Word processers would inject a "CTRL-I" token to toggle 
italics on or off.  Old printers used to use sequences to define the start of 
bold or italic or underlined or whatever sequences.  Those were private and 
pseudo-private mechanisms that were used internally &/or documented for others 
that wanted to interoperate with their systems.  (The printer folks would tell 
the word processers how to make italics happen, then other printer folks would 
use the same or similar mechanisms for compatibility - except for the dude that 
didn't get the memo and made their own scheme.)

Unicode was explicitly intended *not* to encode any of that kind of markup, 
and, instead, be "plain text," leaving other interesting metadata to other 
higher level protocols.  Whether those be word breaking, sentence parsing, 
formatting, buffer sizing or whatever.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via 
Unicode
Sent: Wednesday, July 3, 2019 4:20 PM
To: unicode@unicode.org
Subject: Re: Unicode "no-op" Character?

On Wed, 3 Jul 2019 17:51:29 -0400
"Mark E. Shoulson via Unicode"  wrote:

> I think the idea being considered at the outset was not so complex as 
> these (and indeed, the point of the character was to avoid making 
> these kinds of decisions).

Shawn Steele appeared to be claiming that there was no good, interesting reason 
for separating base character and combining mark.  I was refuting that notion.  
Natural text boundaries can get very messy - some languages have word 
boundaries that can be *within* an indecomposable combining mark.

Richard.



RE: Unicode "no-op" Character?

2019-06-23 Thread Shawn Steele via Unicode
But... it's not actually discardable.  The hypothetical "packet" architecture 
(using the term architecture somewhat loosely) needed the information being 
tunneled in by this character.  If it was actually discardable, then the "noop" 
character wouldn't be required as it would be discarded.

Since the character conveys meaning to some parts of the system, then it's not 
actually a "noop" and it's not actually "discardable".  

What is actually being requested isn't a character that nobody has meaning for, 
but rather a character that has no PUBLIC meaning.  

Which leads us to the key.  The desire is for a character that has no public 
meaning, but has some sort of private meaning.  In other words it has a private 
use.  Oddly enough, there is a group of characters intended for private use, in 
the PUA ;-)

Of course if the PUA characters interfered with the processing of the string, 
they'd need to be stripped, but you're sort of already in that position by 
having a private flag in the middle of a string.

-Shawn  

-Original Message-
From: Unicode  On Behalf Of Slawomir Osipiuk via 
Unicode
Sent: Saturday, June 22, 2019 6:10 PM
To: unicode@unicode.org
Cc: 'Richard Wordingham' 
Subject: RE: Unicode "no-op" Character?

That's the key to the no-op idea. The no-op character could not ever be assumed 
to survive interchange with another process. It'd be canonically equivalent to 
the absence of character. It could be added or removed at any position by a 
Unicode-conformant process. A program could wipe all the no-ops from a string 
it has received, and insert its own for its own purposes. (In fact, it should 
wipe the old ones so as not to confuse
itself.) It's "another process's discardable junk" unless known, 
internally-only, to be meaningful at a particular stage.

While all the various (non)joiners/ignorables are interesting, none of them 
have this property.

In fact, that might be the best description: It's not just an "ignorable", it's 
a "discardable". Unicode doesn't have that, does it?

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham via Unicode
Sent: Saturday, June 22, 2019 20:59
To: unicode@unicode.org
Cc: Shawn Steele
Subject: Re: Unicode "no-op" Character?

If they're conveying an invisible message, one would have to strip out original 
ZWNBSP/WJ/ZWSP that didn't affect line-breaking.  The weak point is that that 
assumes that line-break opportunities are well-defined.  For example, they 
aren't for SE Asian text.

Richard.




RE: Unicode "no-op" Character?

2019-06-22 Thread Shawn Steele via Unicode
Assuming you were using any of those characters as "markup", how would you know 
when they were intentionally in the string and not part of your marking system?

-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via 
Unicode
Sent: Saturday, June 22, 2019 4:17 PM
To: unicode@unicode.org
Subject: Re: Unicode "no-op" Character?

On Sat, 22 Jun 2019 17:50:49 -0400
Sławomir Osipiuk via Unicode  wrote:

> If faced with the same problem today, I’d probably just go with U+FEFF 
> (really only need a single char, not a whole delimited substring) or a 
> different C0 control (maybe SI/LS0) and clean up the string if it 
> needs to be presented to the user.

You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better
U+2060 WJ) and U+200B (ZWSP).  

> I still think an “idle”/“null tag”/“noop”  character would be a neat 
> addition to Unicode, but I doubt I can make a convincing enough case 
> for it.

You'd still only be able to insert it between characters, not between code 
units, unless you were using UTF-32.

Richard.




RE: Unicode "no-op" Character?

2019-06-22 Thread Shawn Steele via Unicode
+ the list.  For some reason the list's reply header is confusing.

From: Shawn Steele
Sent: Saturday, June 22, 2019 4:55 PM
To: Sławomir Osipiuk 
Subject: RE: Unicode "no-op" Character?

The original comment about putting it between the base character and the 
combining diacritic seems peculiar.  I'm having a hard time visualizing how 
that kind of markup could be interesting?

From: Unicode mailto:unicode-boun...@unicode.org>> 
On Behalf Of Slawomir Osipiuk via Unicode
Sent: Saturday, June 22, 2019 2:02 PM
To: unicode@unicode.org
Subject: RE: Unicode "no-op" Character?

I see there is no such character, which I pretty much expected after Google 
didn't help.

The original problem I had was solved long ago but the recent article about 
watermarking reminded me of it, and my question was mostly out of curiosity. 
The task wasn't, strictly speaking, about "padding", but about marking - 
injecting "flag" characters at arbitrary points in a string without affecting 
the resulting visible text. I think we ended up using ESC, which is a dumb 
choice in retrospect, though the whole approach was a bit of a hack anyway and 
the process it was for isn't being used anymore.


RE: Unicode "no-op" Character?

2019-06-21 Thread Shawn Steele via Unicode
I'm curious what you'd use it for?

From: Unicode  On Behalf Of Slawomir Osipiuk via 
Unicode
Sent: Friday, June 21, 2019 5:14 PM
To: unicode@unicode.org
Subject: Unicode "no-op" Character?

Does Unicode include a character that does nothing at all? I'm talking about 
something that can be used for padding data without affecting interpretation of 
other characters, including combining chars and ligatures. I.e. a character 
that could hypothetically be inserted between a latin E and a combining acute 
and still produce É. The historical description of U+0016 SYNCHRONOUS IDLE 
seems like pretty much exactly what I want. It only has one slight 
disadvantage: it doesn't work. All software I've tried displays it as an 
unknown character and it definitely breaks up combinations. And U+ NULL 
seems even worse.

I can imagine the answer is that this thing I'm looking for isn't a character 
at all and so should be the business of "a higher-level protocol" and not what 
Unicode was made for... but Unicode does include some odd things so I wonder if 
there is something like that regardless. Can anyone offer any suggestions?

Sławomir Osipiuk


RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
>> If they are obsolete apps, they don’t use CLDR / ICU, as these are designed 
>> for up-to-date and fully localized apps. So one hassle is off the table.

Windows uses CLDR/ICU.  Obsolete apps run on Windows.  That statement is a 
little narrowminded.

>> I didn’t look into these date interchanges but I suspect they won’t use any 
>> thousands separator at all to interchange data.

Nope

>> The group separator is only for display and print

Yup, and people do the wrong thing so often that I even blogged about it. 
https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/

>> Sorry you did skip this one:

Oops, I did mean to respond to that one and accidentally skipped it.

>> What are all these expected to do while localized with scripts outside 
>> Windows code pages?

(We call those “unicode-only” locales FWIW)

The users that are not supported by legacy apps can’t use those apps 
(obviously).  And folks are strongly encouraged to write apps (and protocols) 
that Use Unicode (I’ve blogged about that too).  However, the fact that an app 
may run very poorly in Cherokee or whatever doesn’t mean that there aren’t a 
bunch of French enterprises that depend on that app for their day-to-day 
business.

In order for the “unicode-only” locale users to use those apps, the app would 
need to be updated, or another app with the appropriate functionality would 
need to be selected.

However, that still doesn’t impact the current French users that are “ok” with 
their current non-Unicode app.  Yes, I would encourage them to move to Unicode, 
however they tend to not want to invest in migration when they don’t see an 
urgent need.

Since Windows depends on CLDR and ICU data, updates to that data means that 
those customers can experience pain when trying to upgrade to newer versions of 
Windows.  We get those support calls, they don’t tend to pester CLDR.

Which is why I suggested an “opt-in” alt form that apps wanting “civilized” 
behavior could opt-into (at least for long enough that enough badly behaved 
apps would be updated to warrant moving that to the default.)

The data for locales like French tends to have been very stable for decades.  
Changes to data for major locales like that are more disruptive than to newer 
emerging markets where the data is undergoing more churn.

-Shawn



RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
>> Keeping these applications outdated has no other benefit than providing a 
>> handy lobbying tool against support of NNBSP.
I believe you’ll find that there are some French banks and other institutions 
that depend on such obsolete applications (unfortunately).
Additionally, I believe you’ll find that there are many scenarios where older 
applications and newer applications need to exchange data.  Either across the 
network, the web, or even on the same machine.  One app expecting NNBSP and 
another expecting NBSP on the same machine will likely lead to confusion.
This could be something a “new” app running with the latest & greatest locale 
data and trying to import the legacy data users had saved on that app.  Or 
exchanging data with an application using the system settings which are perhaps 
older.
>> Also when you need those apps, just tailor your French accordingly.
Having the user attempt to “correct” their settings may not be sufficient to 
resolve these discrepancies because not all applications or frameworks properly 
consider the user overrides on all platforms.
>> That should not impact all other users out there interested in a civilized 
>> layout.
I’m not sure that the choice of the word “civilized” adds value to the 
conversation.  We have pretty much zero feedback that the OS’s French 
formatting is “uncivilized” or that the NNBSP is required for correct support.
>> As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is 
>> for.
For compatibility, I’d actually much prefer that CLDR have an alt “best 
practice” field that maintained the existing U+00A0 behavior for compatibility, 
yet allowed applications wanting the newer typographic experience to opt-in to 
the “best practice” alternative data.  As applications became used to the idea 
of an alternative for U+00A0, then maybe that could be flip-flopped and put 
U+00A0 into a “legacy” alt form in a few years.
Normally I’m all for having the “best” data in CLDR, and there are many locales 
that have data with limited support for whatever reasons.  U+00A0 is pretty 
exceptional in my view though, developers have been hard-coding dependencies on 
that value for ½ a century without even realizing there might be other types of 
non-breaking spaces.  Sure, that’s not really the best practice, particularly 
in modern computing, but I suspect you’ll still find it taught in CS classes 
with little regard to things like NNBSP.
-Shawn



RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
I've been lurking on this thread a little.

This discussion has gone “all over the place”, however I’d like to point out 
that part of the reason NBSP has been used for thousands separators is because 
that it exists in all of those legacy codepages that were mentioned predating 
Unicode.

Whether or not NNBSP provides a better typographical experience, there are a 
lot of legacy applications, and even web services, that depend on legacy 
codepages.  NNBSP may be best for layout, but I doubt that making it work 
perfectly for thousand separators is going to be some sort of magic bullet that 
solves any problems that NBSP provides.

If folks started always using NNBSP, there are a lot of legacy applications 
that are going to start giving you ? in the middle of your numbers. 

Here’s a partial “dir > out.txt” after changing my number thousands separator 
to NNBSP in French on Windows (for example).
13/01/2019  09:4815?360 AcXtrnal.dll
13/01/2019  09:4654?784 AdaptiveCards.dll
13/01/2019  09:4667?584 AddressParser.dll
13/01/2019  09:4724?064 adhapi.dll
13/01/2019  09:4797?792 adhsvc.dll
10/04/2013  08:32   154?624 AdjustCalendarDate.exe
10/04/2013  08:32 1?190?912 AdjustCalendarDate.pdb
13/01/2019  10:47   534?016 AdmTmpl.dll
13/01/2019  09:4858?368 adprovider.dll
13/01/2019  10:47   136?704 adrclient.dll
13/01/2019  09:48   248?832 adsldp.dll
13/01/2019  09:46   251?392 adsldpc.dll
13/01/2019  09:48   101?376 adsmsext.dll
13/01/2019  09:48   350?208 adsnt.dll
13/01/2019  09:46   849?920 adtschema.dll
13/01/2019  09:45   146?944 AdvancedEmojiDS.dll

There are lots of web services that still don’t expect UTF-8 (I know, bad on 
them), and many legacy applications that don’t have proper UTF-8 or Unicode 
support (I know, they should be updated).  It doesn’t seem to me that changing 
French thousands separator to NNBSP solves all of the perceived problems.

-Shawn

 
http://blogs.msdn.com/shawnste



RE: Why so much emoji nonsense? - Proscription

2018-02-15 Thread Shawn Steele via Unicode
Depends on your perspective I guess ;)

-Original Message-
From: Unicode <unicode-boun...@unicode.org> On Behalf Of Richard Wordingham via 
Unicode
Sent: Thursday, February 15, 2018 2:31 PM
To: unicode@unicode.org
Subject: Re: Why so much emoji nonsense? - Proscription

On Thu, 15 Feb 2018 21:38:19 +
Shawn Steele via Unicode <unicode@unicode.org> wrote:

> I realize "I'd've" isn't
> "right",

Where did that proscription come from?  Is it perhaps a perversion of the 
proscription of "I'd of"?

Richard.



RE: Why so much emoji nonsense?

2018-02-15 Thread Shawn Steele via Unicode
For voice we certainly get clues about the speaker's intent from their tone.  
That tone can change the meaning of the same written word quite a bit.  There 
is no need for video to wildly change the meaning of two different readings of 
the exact same words.

Writers have always taken liberties with the written word to convey ideas that 
aren't purely grammatically correct.  This may be most obvious in poetry, but 
it happens even in other writings.  Maybe their entire reason was so that 
future English teachers would ask us why some author chose some peculiar 
structure or whatever.

I find it odd that I write things like "I'd've thought" (AFAIK I hadn't been 
exposed to I'd've and it just spontaneously occurred, but apparently others 
(mis)use it as well).  I realize "I'd've" isn't "right", but it better conveys 
my current state of mind than spelling it out would've.  Similarly, if I find 
myself smiling internally while I'm writing, it's going to get a :)

Though I may use :), I agree that most of my use of emoji is more decorative, 
however including other emoji can also make the sentence feel more "fun".  

If I receive a  as the only response to a comment I made, that conveys 
information that I would have a difficult time putting into words.

I don't find emoji to necessarily be a "post-literate" thing.  Just a different 
way of communicating.  I have also seen them used in a "pre-literate" fashion.  
Helping people that were struggling to learn to read get past the initial 
difficulties they were having on their way to becoming more literate.

-Shawn

-Original Message-
From: Unicode  On Behalf Of James Kass via Unicode
Sent: Thursday, February 15, 2018 12:53 PM
To: Ken Whistler 
Cc: Erik Pedersen ; Unicode Public 
Subject: Re: Why so much emoji nonsense?

Ken Whistler replied to Erik Pedersen,

> Emoticons were invented, in large part, to fill another major hole in 
> written communication -- the need to convey emotional state and 
> affective attitudes towards the text.

There is no such need.  If one can't string words together which 'speak for 
themselves', there are other media.  I suspect that emoticons were invented for 
much the same reason that "typewriter art"
was invented:  because it's there, it's cute, it's clever, and it's novel.

> This is the kind of information that face-to-face communication has a 
> huge and evolutionarily deep bandwidth for, but which written 
> communication typically fails miserably at.

Does Braille include emoji?  Are there tonal emoticons available for telephone 
or voice transmission?  Does the telephone "fail miserably"
at oral communication because there's no video to transmit facial tics and hand 
gestures?  Did Pontius Pilate have a cousin named Otto?
These are rhetorical questions.

For me, the emoji are a symptom of our moving into a post-literate age.  We 
already have people in positions of power who pride themselves on their 
marginal literacy and boast about the fact that they don't read much.  Sad!



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
But those are IETF definitions.  They don’t have to mean the same thing in 
Unicode - except that people working in this field probably expect them to.

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag 
via Unicode
Sent: Thursday, June 1, 2017 11:44 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:

I think that the (or a) key problem is that the current "best practice" is 
treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".



People reading standards tend to treat "SHOULD" and "MUST" as the same thing.

It's not that they "tend to", it's in RFC 2119:
SHOULD

 This word, or the adjective "RECOMMENDED", mean that there

   may exist valid reasons in particular circumstances to ignore a

   particular item, but the full implications must be understood and

   carefully weighed before choosing a different course.


The clear inference is that while the non-recommended practice is not 
prohibited, you better have some valid reason why you are deviating from it 
(and, reading between the lines, it would not hurt if you documented those 
reasons).



 So, when an implementation deviates, then you get bugs (as we see here).  
Given that there are very valid engineering reasons why someone might want to 
choose a different behavior for their needs - without harming the intent of the 
standard at all in most cases - I think the current/proposed language is too 
"strong".

Yes and no. ICU would be perfectly fine deviating from the existing 
recommendation and stating their engineering reasons for doing so. That would 
allow them to close their bug ("by documentation").

What's not OK is to take an existing recommendation and change it to something 
else, just to make bug reports go away for one implementations. That's like two 
sleepers fighting over a blanket that's too short. Whenever one is covered, the 
other is exposed.

If it is discovered that the existing recommendation is not based on anything 
like truly better behavior, there may be a case to change it to something 
that's equivalent to a MAY. Perhaps a list of nearly equally capable options.

(If that language is not in the standard already, a strong "an implementation 
MUST not depend on the use of a particular strategy for replacement of invalid 
code sequences", clearly ought to be added).

A./







-Shawn



-Original Message-

From: Alastair Houghton [mailto:alast...@alastairs-place.net]

Sent: Thursday, June 1, 2017 4:05 AM

To: Henri Sivonen <hsivo...@hsivonen.fi><mailto:hsivo...@hsivonen.fi>

Cc: unicode Unicode Discussion 
<unicode@unicode.org><mailto:unicode@unicode.org>; Shawn Steele 
<shawn.ste...@microsoft.com><mailto:shawn.ste...@microsoft.com>

Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8



On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode 
<unicode@unicode.org><mailto:unicode@unicode.org> wrote:



On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode

<unicode@unicode.org><mailto:unicode@unicode.org> wrote:

* As far as I can tell, there are two (maybe three) sane approaches to this 
problem:

   * Either a "maximal" emission of one U+FFFD for every byte that exists 
outside of a good sequence

   * Or a "minimal" version that presumes the lead byte was counting trail 
bytes correctly even if the resulting sequence was invalid.  In that case just 
use one U+FFFD.

   * And (maybe, I haven't heard folks arguing for this one) emit one 
U+FFFD at the first garbage byte and then ignore the input until valid data 
starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred 
garbage bytes as long as there weren't any valid sequences within that group).



I think it's not useful to come up with new rules in the abstract.



The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).



All of these approaches are explicitly allowed by the standard at present.  All 
three are reasonable, and each has its own pros and cons in a technical sense 
(leaving aside how prevalent the approach in question might be).  In a general 
purpose library I’d probably go for the second one; if I knew I was dealing 
with a potentially corrupt compressed or encrypted stream,

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
I think that the (or a) key problem is that the current "best practice" is 
treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same thing.  
So, when an implementation deviates, then you get bugs (as we see here).  Given 
that there are very valid engineering reasons why someone might want to choose 
a different behavior for their needs - without harming the intent of the 
standard at all in most cases - I think the current/proposed language is too 
"strong".

-Shawn

-Original Message-
From: Alastair Houghton [mailto:alast...@alastairs-place.net] 
Sent: Thursday, June 1, 2017 4:05 AM
To: Henri Sivonen <hsivo...@hsivonen.fi>
Cc: unicode Unicode Discussion <unicode@unicode.org>; Shawn Steele 
<shawn.ste...@microsoft.com>
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode <unicode@unicode.org> wrote:
> 
> On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode 
> <unicode@unicode.org> wrote:
>> * As far as I can tell, there are two (maybe three) sane approaches to this 
>> problem:
>>* Either a "maximal" emission of one U+FFFD for every byte that 
>> exists outside of a good sequence
>>* Or a "minimal" version that presumes the lead byte was counting 
>> trail bytes correctly even if the resulting sequence was invalid.  In that 
>> case just use one U+FFFD.
>>* And (maybe, I haven't heard folks arguing for this one) emit one 
>> U+FFFD at the first garbage byte and then ignore the input until valid data 
>> starts showing up again.  (So you could have 1 U+FFFD for a string of a 
>> hundred garbage bytes as long as there weren't any valid sequences within 
>> that group).
> 
> I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  All 
three are reasonable, and each has its own pros and cons in a technical sense 
(leaving aside how prevalent the approach in question might be).  In a general 
purpose library I’d probably go for the second one; if I knew I was dealing 
with a potentially corrupt compressed or encrypted stream, I might well plump 
for the third.  I can even *imagine* there being circumstances under which I 
might choose the first for some reason, in spite of my preference for the 
second approach.

I don’t think it makes sense to standardise on *one* of these approaches, so if 
what you’re saying is that the “Best Practice” has been treated as if it was 
part of the specification (and I think that *is* essentially your claim), then 
I’m in favour of either removing it completely, or (better) replacing it with 
Shawn’s suggestion - i.e. listing three reasonable approaches and telling 
developers to document which they take and why.

Kind regards,

Alastair.

--
http://alastairs-place.net




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> And *that* is what the specification says.  The whole problem here is that 
> someone elevated
> one choice to the status of “best practice”, and it’s a choice that some of 
> us don’t think *should*
> be considered best practice.

> Perhaps “best practice” should simply be altered to say that you *clearly 
> document* your behavior
> in the case of invalid UTF-8 sequences, and that code should not rely on the 
> number of U+FFFDs 
> generated, rather than suggesting a behaviour?

That's what I've been suggesting.

I think we could maybe go a little further though:

* Best practice is clearly not to depend on the # of U+FFFDs generated by 
another component/app.  Clearly that can't be relied upon, so I think everyone 
can agree with that.
* I think encouraging documentation of behavior is cool, though there are 
probably low priority bugs and people don't like to read the docs in that 
detail, so I wouldn't expect very much from that.
* As far as I can tell, there are two (maybe three) sane approaches to this 
problem:
* Either a "maximal" emission of one U+FFFD for every byte that exists 
outside of a good sequence 
* Or a "minimal" version that presumes the lead byte was counting trail 
bytes correctly even if the resulting sequence was invalid.  In that case just 
use one U+FFFD.
* And (maybe, I haven't heard folks arguing for this one) emit one 
U+FFFD at the first garbage byte and then ignore the input until valid data 
starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred 
garbage bytes as long as there weren't any valid sequences within that group).
* I'd be happy if the best practice encouraged one of those two (or maybe 
three) approaches.  I think an approach that called rand() to see how many 
U+FFFDs to emit when it encountered bad data is fair to discourage.

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> it’s more meaningful for whoever sees the output to see a single U+FFFD 
> representing 
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid 
> lead byte and 
> then another for an “unexpected” trailing byte.

I disagree.  It may be more meaningful for some applications to have a single 
U+FFFD representing an illegally encoded 2-byte NULL than to have 2 U+FFFDs.  
Of course then you don't know if it was an illegally encoded 2-byte NULL or an 
illegally encoded 3-byte NULL or whatever, so some information that other 
applications may be interested in is lost.

Personally, I prefer the "emit a U+FFFD if the sequence is invalid, drop the 
byte, and try again" approach.  

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> For implementations that emit FFFD while handling text conversion and repair 
> (ie, converting ill-formed
> UTF-8 to well-formed), it is best for interoperability if they get the same 
> results, so that indices within the
> resulting strings are consistent across implementations for all the correct 
> characters thereafter.

That seems optimistic :)

If interoperability is the goal, then it would seem to me that changing the 
recommendation would be contrary to that goal.  There are systems that will not 
or cannot change to a new recommendation.  If such systems are updated, then 
adoption of those systems will likely take some time.

In other words, I cannot see where “consistency across implementations” would 
be achievable anytime in the near future.

It seems to me that being able to use a data stream of ambiguous quality in 
another application with predictable results, then that stream should be 
“repaired” prior to being handed over.  Then both endpoints would be using the 
same set of FFFDs, whether that was single or multiple forms.


-Shawn


RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> > In either case, the bad characters are garbage, so neither approach is 
> > "better" - except that one or the other may be more conducive to the 
> > requirements of the particular API/application.

> There's a potential issue with input methods that indirectly edit the backing 
> store.  For example,
> GTK input methods (e.g. function gtk_im_context_delete_surrounding()) can 
> delete an amount 
> of text specified in characters, not storage units.  (Deletion by storage 
> units is not available in this
> interface.)  This might cause utter confusion or worse if the backing store 
> starts out corrupt. 
> A corrupt backing store is normally manually correctable if most of the text 
> is ASCII.

I think that's sort of what I said: some approaches might work better for some 
systems and another approach might work better for another system.  This also 
presupposes a corrupt store.

It is unclear to me what the expected behavior would be for this corruption if, 
for example, there were merely a half dozen 0x80 in the middle of ASCII text?  
Is that garbage a single "character"?  Perhaps because it's a consecutive 
string of bad bytes?  Or should it be 6 characters since they're nonsense?  Or 
maybe 2 characters because the maximum # of trail bytes we can have is 3?

What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?

I can see how different implementations might be able to come up with "rules" 
that would help them navigate (or clean up) those minefields, however it is not 
at all clear to me that there is a "best practice" for those situations.

There also appears to be a special weight given to non-minimally-encoded 
sequences.  It would seem to me that none of these illegal sequences should 
appear in practice, so we have either:

* A bad encoder spewing out garbage (overlong sequences)
* Flipped bit(s) due to storage/transmission/whatever errors
* Lost byte(s) due to storage/transmission/coding/whatever errors
* Extra byte(s) due to whatever errors
* Bad string manipulation breaking/concatenating in the middle of sequences, 
causing garbage (perhaps one of the above 2 codeing errors).

Only in the first case, of a bad encoder, are the overlong sequences actually 
"real".  And that shouldn't happen (it's a bad encoder after all).  The other 
scenarios seem just as likely, (or, IMO, much more likely) than a badly 
designed encoder creating overlong sequences that appear to fit the UTF-8 
pattern but aren't actually UTF-8.

The other cases are going to cause byte patterns that are less "obvious" about 
how they should be navigated for various applications.

I do not understand the energy being invested in a case that shouldn't happen, 
especially in a case that is a subset of all the other bad cases that could 
happen.

-Shawn 



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence  
> as U+002F.

Sort of, maybe.  It was not legal for them to generate it though.  So you could 
kind of infer that it was not a legal sequence.

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Which is to completely reverse the current recommendation in Unicode 9.0. 
> While I agree that this might help you fending off a bug report, it would 
> create chances for bug reports for Ruby, Python3, many if not all Web 
> browsers,...

& Windows & .Net

Changing the behavior of the Windows / .Net SDK is a non-starter.

> Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows 
> what it means, but everybody knows they don't exist.

Yes, this is trying to improve the language for a scenario that CANNOT HAPPEN.  
We're trying to optimize a case for data that implementations should never 
encounter.  It is sort of exactly like optimizing for the case where your data 
input is actually a dragon and not UTF-8 text.  

Since it is illegal, then the "at least 1 FFFD but as many as you want to emit 
(or just fail)" is fine.

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> I think nobody is debating that this is *one way* to do things, and that some 
> code does it.

Except that they sort of are.  The premise is that the "old language was 
wrong", and the "new language is right."  The reason we know the old language 
was wrong was that there was a bug filed against an implementation because it 
did not conform to the old language.  The response to the application bug was 
to change the standard's recommendation.

If this language is adopted, then the opposite is going to happen:  Bugs will 
be filed against applications that conform to the old recommendation and not 
the new recommendation.  They will say "your code could be better, it is not 
following the recommendation."  Eventually that will escalate to some level 
that it will need to be considered, however, regardless of the improvements, it 
will be a "breaking change".

Changing code from one recommendation to another will change behavior.  For 
applications or SDKs with enough visibility, that will break *someone* because 
that's how these things work.  For applications that choose not to change, in 
response to some RFP, someone's going to say "you don't fully conform to 
Unicode, we'll go with a different vendor."  Not saying that these things make 
sense, that's just the way the world works.

In some situations, one form is better, in some cases another form is better.  
If the intent is truly that there is not "one way to do things," then the 
language should reflect that.

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Shawn Steele via Unicode
So basically this came about because code got bugged for not following the 
"recommendation."   To fix that, the recommendation will be changed.  However 
then that is going to lead to bugs for other existing code that does not follow 
the new recommendation.

I totally get the forward/backward scanning in sync without decoding reasoning 
for some implementations, however I do not think that the practices that 
benefit those should extend to other applications that are happy with a 
different practice.

In either case, the bad characters are garbage, so neither approach is "better" 
- except that one or the other may be more conducive to the requirements of the 
particular API/application.

I really think the correct approach here is to allow any number of replacement 
characters without prejudice.  Perhaps with suggestions for pros and cons of 
various approaches if people feel that is really necessary.

-Shawn

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson 
via Unicode
Sent: Friday, May 26, 2017 2:16 PM
To: Ken Whistler 
Cc: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 05/26/2017 12:22 PM, Ken Whistler wrote:
> 
> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
>> The link provided about the PRI doesn't lead to the comments.
>>
> 
> PRI #121 (August, 2008) pre-dated the practice of keeping all the 
> feedback comments together with the PRI itself in a numbered directory 
> with the name "feedback.html". But the comments were collected 
> together at the time and are accessible here:
> 
> http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121
> 
> Also there was a separately submitted comment document:
> 
> http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt
> 
> And the minutes of the pertinent UTC meeting (UTC #116):
> 
> http://www.unicode.org/L2/L2008/08253.htm
> 
> The minutes simply capture the consensus to adopt Option #2 from PRI 
> #121, and the relevant action items.
> 
> I now return the floor to the distinguished disputants to continue 
> litigating history. ;-)
> 
> --Ken
> 
>

The reason this discussion got started was that in December, someone came to me 
and said the code I support does not follow Unicode best practices, and 
suggested I need to change, though no ticket (yet) has been filed.  I was 
surprised, and posted a query to this list about what the advantages of the new 
approach are.  There were a number of replies, but I did not see anything that 
seemed definitive.  After a month, I created a ticket in Unicode and Markus was 
assigned to research it, and came up with the proposal currently being debated.

Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print. 
That seems to be borne out by Markus, even with his stake in ICU, supporting 
option #2.

Looking at the comments, I don't see any discussion of the effect of this on 
overlong treatments.  My guess is that the effect change was unintentional.

So I have code that handled overlongs in the only correct way possible when 
they were acceptable, and in the obvious way after they became illegal, and now 
without apparent discussion (which is very much akin to "flimsy reasons"), it 
suddenly was no longer "best practice".  And that change came "rather late in 
the game".  That this escaped notice for years indicates that the specifics of 
REPLACEMENT CHAR handling don't matter all that much.

To cut to the chase, I think Unicode should issue a Corrigendum to the effect 
that it was never the intent of this change to say that treating overlongs as a 
single unit isn't best practice.  I'm not sure this warrants a full-fledge 
Corrigendum, though.  But I believe the text of the best practices should 
indicate that treating overlongs as a single unit is just as acceptable as 
Martin's interpretation.

I believe this is pretty much in line with Shawn's position.  Certainly, a 
discussion of the reasons one might choose one interpretation over another 
should be included in TUS.  That would likely have satisfied my original query, 
which hence would never have been posted.



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode
> If the thread has made one thing clear is that there's no consensus in the 
> wider community
> that one approach is obviously better. When it comes to ill-formed sequences, 
> all bets are off.
> Simple as that.

> Adding a "recommendation" this late in the game is just bad standards policy.

I agree.  I'm not sure what value this provides.  If someone thought it added 
value to discuss the pros and cons of implementing it one way and the other as 
MAY do this or MAY do that, I don't mind.  But I think both should be 
permitted, and neither should be encouraged with anything stronger than a MAY.

-Shawn




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode
+ the list, which somehow my reply seems to have lost.

> I may have missed something, but I think nobody actually proposed to change 
> the recommendations into requirements

No thanks, that would be a breaking change for some implementations (like mine) 
and force them to become non-complying or potentially break customer behavior.

I would prefer that both options be permitted, perhaps with a few words of 
advantages.

-Shawn




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
> Faster ok, privided this does not break other uses, notably for  random 
> access within strings…

Either way, this is a “recommendation”.  I don’t see how that can provide for 
not-“breaking other uses.”  If it’s internal, you can do what you will, so if 
you need the 1:1 seeming parity, then you can do that internally.  But if 
you’re depending on other APIs/libraries/data source/whatever, it would seem 
like you couldn’t count on that.  (And probably shouldn’t even if it was a 
requirement rather than a recommendation).

I’m wary of the idea of attempting random access on a stream that is also 
manipulating the stream at the same time (decoding apparently).

The U+FFFD emitted by this decoding could also require a different # of bytes 
to reencode.  Which might disrupt the presumed parity, depending on how the 
data access was being handled.

-Shawn


RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
But why change a recommendation just because it “feels like”.  As you said, 
it’s just a recommendation, so if that really annoyed someone, they could do 
something else (eg: they could use a single FFFD).

If the recommendation is truly that meaningless or arbitrary, then we just get 
into silly discussions of “better” that nobody can really answer.

Alternatively, how about “one or more FFFDs?” for the recommendation?

To me it feels very odd to perhaps require writing extra code to detect an 
illegal case.  The “best practice” here should maybe be “one or more FFFDs, 
whatever makes your code faster”.

Best practices may not be requirements, but people will still take time to file 
bugs that something isn’t following a “best practice”.

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Markus Scherer 
via Unicode
Sent: Tuesday, May 16, 2017 11:37 AM
To: Alastair Houghton 
Cc: Philippe Verdy ; Henri Sivonen ; 
unicode Unicode Discussion ; Hans Åberg 

Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

Let me try to address some of the issues raised here.

The proposal changes a recommendation, not a requirement. Conformance applies 
to finding and interpreting valid sequences properly. This includes not 
consuming parts of valid sequences when dealing with illegal ones, as explained 
in the section "Constraints on Conversion Processes".

Otherwise, what you do with illegal sequences is a matter of what you think 
makes sense -- a matter of opinion and convenience. Nothing more.

I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU 
team. At the time, I believe the ISO UTF-8 definition was not yet limited to 
U+10, and decoding overlong sequences and those yielding surrogate code 
points was regarded as a misdemeanor. The spec has been tightened up, but I am 
pretty sure that most people familiar with how UTF-8 came about would recognize 
 and  as single sequences.

I believe that the discussion of how to handle illegal sequences came out of 
security issues a few years ago from some implementations including valid 
single and lead bytes with preceding illegal sequences. Beyond the "Constraints 
on Conversion Processes", there was evidently also a desire to recommend how to 
handle illegal sequences.

I think that the current recommendation was an extrapolation of common practice 
for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, 
but "it feels like" (yes, that's the level of argument for stuff that doesn't 
really matter) not treating  and  as single sequences is 
"weird".

Why do we care how we carve up an illegal sequence into subsequences? Only for 
debugging and visual inspection. Maybe some process is using illegal, overlong 
sequences to encode something special (à la Java string serialization, 
"modified UTF-8"), and for that it might be convenient too to treat overlong 
sequences as single errors.

If you don't like some recommendation, then do something else. It does not 
matter. If you don't reject the whole input but instead choose to replace 
illegal sequences with something, then make sure the something is not nothing 
-- replacing with an empty string can cause security issues. Otherwise, what 
the something is, or how many of them you put in, is not very relevant. One or 
more U+FFFDs is customary.

When the current recommendation came in, I thought it was reasonable but didn't 
like the edge cases. At the time, I didn't think it was important to twiddle 
with the text in the standard, and I didn't care that ICU didn't exactly 
implement that particular recommendation.

I have seen implementations that clobber every byte in an illegal sequence with 
a space, because it's easier than writing an U+FFFD for each byte or for some 
subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long 
illegal subsequence; that's fine, too.

Karl Williamson sent feedback to the UTC, "In short, I believe the best 
practices are wrong." I think "wrong" is far too strong, but I got an action 
item to propose a change in the text. I proposed a modified recommendation. 
Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" 
that was "right".

None of this is motivated by which UTF is used internally.

It is true that it takes a tiny bit more thought and work to recognize a wider 
set of sequences, but a capable implementer will optimize successfully for 
valid sequences, and maybe even for a subset of those for what might be 
expected high-frequency code point ranges. Error handling can go into a slow 
path. In a true state table implementation, it will require more states but 
should not affect the performance of valid sequences.

Many years ago, I decided for ICU to add a small amount of slow-path 
error-handling code for more 

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
Regardless, it's not legal and hasn't been legal for quite some time.  
Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to 
anything depending on that fake-null, so one or three isn't really going to 
matter.

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham via Unicode
Sent: Tuesday, May 16, 2017 10:58 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On Tue, 16 May 2017 17:30:01 +0000
Shawn Steele via Unicode <unicode@unicode.org> wrote:

> > Would you advocate replacing
> 
> >   e0 80 80
> 
> > with
> 
> >   U+FFFD U+FFFD U+FFFD (1)  
> 
> > rather than
> 
> >   U+FFFD   (2)  
> 
> > It’s pretty clear what the intent of the encoder was there, I’d say, 
> > and while we certainly don’t want to decode it as a NUL (that was 
> > the source of previous security bugs, as I recall), I also don’t see 
> > the logic in insisting that it must be decoded to *three* code 
> > points when it clearly only represented one in the input.
> 
> It is not at all clear what the intent of the encoder was - or even if 
> it's not just a problem with the data stream.  E0 80 80 is not 
> permitted, it's garbage.  An encoder can't "intend" it.

It was once a legal way of encoding NUL, just like C0 E0, which is still in 
use, and seems to be the best way of storing NUL as character content in a *C 
string*.  (Strictly speaking, one can't do it.)  It could be lurking in old 
text or come from an old program that somehow doesn't get used for U+0080 to 
U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of 
converting UTF-16 to UTF-8.

Remember the conformance test for the Unicode Collation Algorithm has contained 
lone surrogates in the past, and the UAX on Unicode Regular Expressions used to 
require the ability to search for lone surrogates.

Richard.




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
> Would you advocate replacing

>   e0 80 80

> with

>   U+FFFD U+FFFD U+FFFD (1)

> rather than

>   U+FFFD   (2)

> It’s pretty clear what the intent of the encoder was there, I’d say, and 
> while we certainly don’t 
> want to decode it as a NUL (that was the source of previous security bugs, as 
> I recall), I also don’t
> see the logic in insisting that it must be decoded to *three* code points 
> when it clearly only 
> represented one in the input.

It is not at all clear what the intent of the encoder was - or even if it's not 
just a problem with the data stream.  E0 80 80 is not permitted, it's garbage.  
An encoder can't "intend" it.

Either
A) the "encoder" was attempting to be malicious, in which case the whole thing 
is suspect and garbage, and so the # of FFFD's doesn't matter, or

B) the "encoder" is completely broken, in which case all bets are off, again, 
specifying the # of FFFD's is irrelevant.

C) The data was corrupted by some other means.  Perhaps bad concatenations, 
lost blocks during read/transmission, etc.  If we lost 2 512 byte blocks, then 
maybe we should have a thousand FFFDs (but how would we known?)

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Shawn Steele via Unicode
>> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
>> multiple errors there makes no sense.
> 
> Changing a specification as fundamental as this is something that should not 
> be undertaken lightly.

IMO, the only think that can be agreed upon is that "something's bad with this 
UTF-8 data".  I think that whether it's treated as a single group of corrupt 
bytes or each individual byte is considered a problem should be up to the 
implementation.

#1 - This data should "never happen".  In a system behaving normally, this 
condition should never be encountered.  
  * At this point the data is "bad" and all bets are off.
  * Some applications may have a clue how the bad data could have happened and 
want to do something in particular.
  * It seems odd to me to spend much effort standardizing a scenario that 
should be impossible.
#2 - Depending on implementation, either behavior, or some combination, may be 
more efficient.  I'd rather allow apps to optimize for the common case, not the 
case-that-shouldn't-ever-happen
#3 - We have no clue if this "maximal" sequence was a single error, 2 errors, 
or even more.  The lead byte says how many trail bytes should follow, and those 
should be in a certain range.  Values outside of those conditions are illegal, 
so we shouldn't ever encounter them.  So if we did, then something really weird 
happened.  
  * Did a single character get misencoded?
  * Was an illegal sequence illegally encoded?
  * Perhaps a byte got corrupted in transmission?
  * Maybe we dropped a packet/block, so this is really the beginning of a valid 
sequence and the tail of another completely valid sequence?

In practice, all that most apps would be able to do would be to say "You have 
bad data, how bad I have no clue, but it's not right".  A single bit could've 
flipped, or you could have only 3 pages of a 4000 page document.  No clue at 
all.  At that point it doesn't really matter how many FFFD's the error(s) are 
replaced with, and no assumptions should be made about the severity of the 
error.

-Shawn