Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode Tue, 16 May 2017 03:01:56 -0700

> On 16 May 2017, at 10:29, David Starner <prosfil...@gmail.com> wrote:
> 
> On Tue, May 16, 2017 at 1:45 AM Alastair Houghton 
> <alast...@alastairs-place.net> wrote:
> That’s true anyway; imagine the database holds raw bytes, that just happen to 
> decode to U+FFFD.  There might seem to be *two* names that both contain 
> U+FFFD in the same place.  How do you distinguish between them?
> 
>> If the database holds raw bytes, then the name is a byte string, not a 
>> Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule 
>> to make and enforce that a string in a database is a validly formatted 
>> string; I would hope that most SQL servers do in fact reject malformed UTF-8 
>> strings. On the other hand, I'd expect that an SQL server would accept 
>> U+FFFD in a Unicode string.


Databases typically separate the encoding in which strings are stored from the 
encoding in which an application connected to the database is operating.  A 
database might well hold data in (say) ISO Latin 1, EUC-JP, or indeed any other 
character set, while presenting it to a client application as UTF-8 or UTF-16.  
Hence my comment - application software could very well see two names that are 
apparently identical and that include U+FFFDs in the same places, even though 
the database back-end actually has different strings.  As I said, this is a 
problem we already have.

> I don’t see a problem; the point is that where a structurally valid UTF-8 
> encoding has been used, albeit in an invalid manner (e.g. encoding a number 
> that is not a valid code point, or encoding a valid code point as an 
> over-long sequence), a single U+FFFD is appropriate.  That seems a perfectly 
> sensible rule to adopt.
>  
>> It seems like a perfectly arbitrary rule to adopt; I'd like to assume that 
>> the only source of such UTF-8 data is willful attempts to break security, 
>> and in that case, how is this a win? Nonattack sources of broken data are 
>> much more likely to be the result of mixing UTF-8 with other character 
>> encodings or raw binary data.

I’d say there are three sources of UTF-8 data of that ilk:

(a) bugs,
(b) “Modified UTF-8” and “CESU-8” implementations,
(c) wilful attacks

(b) in particular is quite common, and the result of the presently recommended 
approach doesn’t make much sense there ([c0 80] will get replaced with *two* 
U+FFFDs, while [ed a0 bd ed b8 80] will be replaced by *four* U+FFFDs - 
surrogates aren’t supposed to be valid in UTF-8, right?)

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to