Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
2017-07-25 0:35 GMT+02:00 Doug Ewell via Unicode :

> J Decker wrote:
>
> > I generally accepted any utf-8 encoding up to 31 bits though ( since
> > I was going from the original spec, and not what was effective limit
> > based on unicode codepoint space)
>
> Hey, everybody: Don't do that.
>
> UTF-8 has been constrained to the Unicode code space (maximum U+10,
> four bytes) for almost fourteen years now.


I fully agree. This is now an essential part of UTF-8 that has helped
secure it (including the dangerous unbound loops scanning through buffers
in memory),  and also helped improve performance (when unrolling loops that
you no longer need to count separately, the code expansion is not so large
that you can't do correct branch prediction and can benefit of caching in
code. Due to the way the UCS code spacez is allocated and how they are
used, the branches in your code have very distinctive patterns that are
easy to enumerate; test coverage for those branches is possible without
explosing combinatorially: this eliminates the need of heuristics.

And about the RFC we were discussing, it is rather recent compared to the
approved stabilization of UTF-8 and finally its endorsement by the
industry. UTF-8 is strictly bound to 4 bytes and nothing more. This allows
other things to be developed on top of this fact and used now as a checked
assumption that cannot be broken except by software bugs that will soon
create security problems when checked assumptions will no longer be checked
throughout a processing chain.

The old RFC was not "UTF-8" (even if that name was proposed, it was not
really assigned) but an early proposal in discussion that did not reach the
level of standard or best practice, it was experimental and at that time
there were several other candidates (including also UTF-7 which is now
almost abandoned, and BOCU-8 which is now marginal but was also bound to
the 17 planes limit). The encoding old RFC should just be given another
name, but it is not used for encoding only text, it was describing in fact
a binary format (but for generic variable binary encoding format of numbers
there are now better candidates, which are also not limited to just 31 bits
or even just to unsigned integers, and are also faster to process and more
compact, and have more interesting properties for code analysis and
resistance to encoding and transmission/storage errors).

In the IANA database for charsets, the old RFC encoding has a separate
identifier, but "UTF-8" refers to RFC 3629 (IETF standard 63); the former
proposals in RFC 2279 or RFC 2044 have never been approved standards, but
just drafts mapped in IANA as the obsolete "UNICODE-1-1-UTF-8" (retired
later as it was never approved by Unicode).

The only remaining "charset" in the IANA database that refers to 31 bit
code points is "ISO-10646-UCS-4", but it does not use variable encoding and
does not specify any byte order, it is just a basic subtype for a range of
positive integers, and without any restriction of use, and not necessarily
repreenting text, but it is very inefficient way to encode them, only meant
as an internal temporary transform in transient memory or CPU registers (at
least for 32bit CPUs or higher: it is now almost alway the case today even
in embedded systems, as 4-, 8- or16-bit CPUs are almost dead or will not be
used for international text processing; even the simplest keyboard
controlers that manage ~100-150 keys and a few leds, and reporting at 1kHz
for the fastest ones, are now internally using 32bit CPUs)


Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Doug Ewell via Unicode
J Decker wrote:

> I generally accepted any utf-8 encoding up to 31 bits though ( since
> I was going from the original spec, and not what was effective limit
> based on unicode codepoint space)

Hey, everybody: Don't do that.

UTF-8 has been constrained to the Unicode code space (maximum U+10,
four bytes) for almost fourteen years now. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread J Decker via Unicode
On Mon, Jul 24, 2017 at 1:50 PM, Philippe Verdy  wrote:

> 2017-07-24 21:12 GMT+02:00 J Decker via Unicode :
>
>>
>>
>> If you don't have that last position in a variable, just use 3 tests but
> NO loop at all: if all 3 tests are failing, you know the input was not
> valid at all, and the way to handle this error will not be solved simply by
> using a very unsecure unbound loop like above but by exiting and returning
> an error immediately, or throwing an exception.
>
> The code should better be:
>
> if (from[0]&0xC0 == 0x80) from--;
> else if (from[-1]&0xC0 == 0x80) from -=2;
> else if (from[-2]&0xC0 == 0x80) from -=3;
> if (from[0]&0xC0 == 0x80) throw (some exception);
> // continue here with character encoded as UTF-8 starting at "from"
> (an ASCII byte or an UTF-8 leading byte)
>
>
I generally accepted any utf-8 encoding up to 31 bits though ( since I was
going from the original spec, and not what was effective limit based on
unicode codepoint space) and the while loop is more terse; but is less
optimal because of code pipeline flushing from backward jump; so yes if
series is much better :)  (the original code also has the start of the
string, and strings are effecitvly prefixed with a 0 byte anyway because of
a long little endian size)

and you'd probably be tracking an output offset also, so it becomes a
little longer than the above.

And it should be secured using a guard byte at start of your buffer in
> which the "from" pointer was pointing, so that it will never read something
> else and can generate an error.
>
>


Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
2017-07-24 22:50 GMT+02:00 Philippe Verdy :

> 2017-07-24 21:12 GMT+02:00 J Decker via Unicode :
>
>>
>>
>> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
>> unicode@unicode.org> wrote:
>>
>>> Hi Folks,
>>>
>>> 2. (Bug) The sending application performs the folding process - inserts
>>> CRLF plus white space characters - and the receiving application does the
>>> unfolding process but doesn't properly delete all of them.
>>>
>>> The RFC doesn't say 'characters' but either a space or a tab character
>> (singular)
>>
>>  back scanning is simple enough
>>
>> while( ( from[0] & 0xC0 ) == 0x80 )
>> from--;
>>
>
> Certainly not like this! Backscanning should only directly use a single
> assignement to the last known start position, no loop at all ! UTF-8
> security is based on the fact that its sequences are strictly limited in
> length so that you will never have more than 3 trailing bytes.
>
> If you don't have that last position in a variable, just use 3 tests but
> NO loop at all: if all 3 tests are failing, you know the input was not
> valid at all, and the way to handle this error will not be solved simply by
> using a very unsecure unbound loop like above but by exiting and returning
> an error immediately, or throwing an exception.
>
> The code should better be:
>
> if (from[0]&0xC0 == 0x80) from--;
> else if (from[-1]&0xC0 == 0x80) from -=2;
> else if (from[-2]&0xC0 == 0x80) from -=3;
> if (from[0]&0xC0 == 0x80) throw (some exception);
> // continue here with character encoded as UTF-8 starting at "from"
> (an ASCII byte or an UTF-8 leading byte)
>
Sorry, sent too fast, I should not have copy-pasted lines trying to adapt
your loop; the correct code uses no "else" at all:

> if (from[0]&0xC0 == 0x80) from--;
> if (from[0]&0xC0 == 0x80) from--;
> if (from[0]&0xC0 == 0x80) from--;
> if (from[0]&0xC0 == 0x80) throw (some exception);
> // continue here with character encoded as UTF-8 starting at "from"
> (an ASCII byte or an UTF-8 leading byte)
>
>


Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
2017-07-24 21:12 GMT+02:00 J Decker via Unicode :

>
>
> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
> unicode@unicode.org> wrote:
>
>> Hi Folks,
>>
>> 2. (Bug) The sending application performs the folding process - inserts
>> CRLF plus white space characters - and the receiving application does the
>> unfolding process but doesn't properly delete all of them.
>>
>> The RFC doesn't say 'characters' but either a space or a tab character
> (singular)
>
>  back scanning is simple enough
>
> while( ( from[0] & 0xC0 ) == 0x80 )
> from--;
>

Certainly not like this! Backscanning should only directly use a single
assignement to the last known start position, no loop at all ! UTF-8
security is based on the fact that its sequences are strictly limited in
length so that you will never have more than 3 trailing bytes.

If you don't have that last position in a variable, just use 3 tests but NO
loop at all: if all 3 tests are failing, you know the input was not valid
at all, and the way to handle this error will not be solved simply by using
a very unsecure unbound loop like above but by exiting and returning an
error immediately, or throwing an exception.

The code should better be:

if (from[0]&0xC0 == 0x80) from--;
else if (from[-1]&0xC0 == 0x80) from -=2;
else if (from[-2]&0xC0 == 0x80) from -=3;
if (from[0]&0xC0 == 0x80) throw (some exception);
// continue here with character encoded as UTF-8 starting at "from" (an
ASCII byte or an UTF-8 leading byte)

And it should be secured using a guard byte at start of your buffer in
which the "from" pointer was pointing, so that it will never read something
else and can generate an error.


Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread J Decker via Unicode
On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
unicode@unicode.org> wrote:

> Hi Folks,
>
> 2. (Bug) The sending application performs the folding process - inserts
> CRLF plus white space characters - and the receiving application does the
> unfolding process but doesn't properly delete all of them.
>
> The RFC doesn't say 'characters' but either a space or a tab character
(singular)

 back scanning is simple enough

while( ( from[0] & 0xC0 ) == 0x80 )
from--;

should probably also check that from > (start+1) but since it should be
applied at 75-ish characters, that would be implicitly true.


RE: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
Hi Folks,

Thank you very much for your fantastic comments!

Below I summarized the issue and your comments. At the bottom is a set of 
proposed requirements (for my clients) on applications that receive iCalendar 
files.

Some questions:
 
- Have I captured all your comments? Any more comments?
- Are the proposed requirements sensible? Any more requirements? 

/Roger

Issue: Folding and unfolding content lines in iCalendar files

The iCalendar specification [RFC 5545] says that a content line should not be 
longer than 75 octets:

Lines of text SHOULD NOT be longer
than 75 octets, excluding the line break.
 
The RFC says that long lines should be folded:

Long content lines SHOULD be split
into a multiple line representations
using a line "folding" technique.
That is, a long line can be split between
any two characters by inserting a CRLF
immediately followed by a single linear
white-space character (i.e., SPACE or HTAB).

The RFC says that, when parsing a content line, folded lines must first be 
unfolded:

When parsing a content line, folded lines MUST
first be unfolded. 

using this technique:

Unfolding is accomplished by  removing the
CRLF and the linear white-space character
that immediately follows. 

The RFC acknowledges that some implementations might do folding in the middle 
of a multi-octet sequence:

Note: It is possible for very simple
implementations to generate improperly
folded lines in the middle of a UTF-8
multi-octet sequence.  For this reason,
implementations need to unfold lines
in such a way to properly restore the
original sequence. 

Here is an example of folding in the middle of a UTF-8 multi-octet sequence: 

The iCalendar file contains the Yen sign (U+00A5), which is represented by the 
byte sequence 0xC2 0xA5 in UTF-8. The content line containing the Yen sign is 
folded in the middle of the two bytes. The result is 0xC2 0x0D 0x0A 0x20 0xA5, 
which isn't valid UTF-8 any longer.

Proposed requirements on the behavior of applications that receive iCalendar 
files:

1. (Bug) The receiving application does not recognize that it has received an 
iCalendar file.

2. (Bug) The sending application performs the folding process - inserts CRLF 
plus white space characters - and the receiving application does the unfolding 
process but doesn't properly delete all of them.

3. (Non-conformant behavior) The receiving application, after folding and 
before unfolding, attempts to interpret the partial UTF-8 sequences and convert 
them into replacement characters or worse.



Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Doug Ewell via Unicode
Costello, Roger L. wrote:

> Suppose an application splits a UTF-8 multi-octet sequence. The
> application then sends the split sequence to a client. The client must
> restore the original sequence. 
>
> Question: is it possible to split a UTF-8 multi-octet sequence in such
> a way that the client cannot unambiguously restore the original
> sequence? 

1. (Bug) The folding process inserts CRLF plus white space characters,
and the unfolding process doesn't properly delete all of them.

2. (Non-conformant behavior) Some process, after folding and before
unfolding, attempts to interpret the partial UTF-8 sequences and
converts them into replacement characters or worse.

In a minimally decent implementation, splitting and reassembling a UTF-8
sequence should always yield the correct result; there should be no
ambiguity.

A good implementation, of course, would know the character encoding of
the data, and would not split multi-byte sequences in that encoding to
begin with.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
Also note that the maximum line-length in that RFC is a SHOULD and not a
MUST. This is intended to give a reasonable hint for the limit used in
implementations that process data in the given format: The RFC suggests a
maximum line length of 75 "characters", excluding the CRLF+SPACE
continuation sequence (not clear here what it means given that it refers to
UTF-8: should it be "code units", i.e. bytes?)

Due to this ambiguity, all implementations will need to interpret it as id
they are actually 75 Unicode characters, which could all be up to 4 bytes
in UTF-8, i.e. 300 bytes. Most implementations will use input buffers for
lines up to 512 bytes (including the CRLF+SPACE continuation), so it will
be simpler to handle the case of continuation just AFTER the line length
limit has been reached, without ever rolling back. And in all cases, there
should never be any continuation sequence CRLF+SPACE in the middle of any
UTF-8 sequence without breaking the initial UTF-8 condition which is
assumed by theis RFC, i.e. without breaking conformance to that RFC.

If an implementation thinks that 75 is a number of bytes, it is wrong, but
anyway given the UTF-8 reference, it could still use it but should not
break in the middle of an UTF-8 sequence, but it will be still safe for
them to break just after it, even if the line (excluding the the CRLF+SPACE
contituation sequence) will be up to 78 bytes long. Decoders will still be
able to parse it without breaking if they have the most common 512-byte
input buffer.


2017-07-24 17:27 GMT+02:00 Philippe Verdy :

> But at the same time that RFC makes a direct reference as UTF-8 as being
> the default charset, so an implementation of the RFC cannot be agnostic to
> what is UTF-8 and will not break in the middle of a conforming UTF-8
> sequence.
>
> When the limit is reached, that implementations knows that it cannot cut
> at a position of an UTF-8 trailing byte, and knows that it can safely
> rollaback at most 3 bytes before to locate conforming leading UTF-8 byte to
> split the line **before** it, or any 7-bit ASCII byte to split the line
> just **after** it). This requires very small buffering and this is a
> fundamendal property of UTF-8.
>
> Other character sets -- including /UTF-(16|32)([LB]E)?/ !!! --- are not
> directly supported, except by external decoders which would convert their
> input stream to UTF-8 (with all the same issues that may occur for such
> conversion when it is not roundtrip compatible or the input does not
> conform the specificvation of the input charset, but this is not the
> problem of this RFC: these decoders may also rollback internally or attempt
> to guess another charset or will use substitution, but they are supposed to
> generate conforming UTF-8 on output).
>
>
> 2017-07-24 17:01 GMT+02:00 Steffen Nurpmeso via Unicode <
> unicode@unicode.org>:
>
>> "Costello, Roger L. via Unicode"  wrote:
>>  |Suppose an application splits a UTF-8 multi-octet sequence. The
>> application \
>>  |then sends the split sequence to a client. The client must restore \
>>  |the original sequence.
>>  |
>>  |Question: is it possible to split a UTF-8 multi-octet sequence in such \
>>  |a way that the client cannot unambiguously restore the original
>> sequence?
>>  |
>>  |Here is the source of my question:
>>  |
>>  |The iCalendar specification [RFC 5545] says that long lines must be
>> folded:
>>  |
>>  | Long content lines SHOULD be split
>>  |  into a multiple line representations
>>  |  using a line "folding" technique.
>>  |  That is, a long line can be split between
>>  |  any two characters by inserting a CRLF
>>  |  immediately followed by a single linear
>>  |  white-space character (i.e., SPACE or HTAB).
>>  |
>>  |The RFC says that, when parsing a content line, folded lines must first
>> \
>>  |be unfolded using this technique:
>>  |
>>  | Unfolding is accomplished by removing
>>  |  the CRLF and the linear white-space
>>  |  character that immediately follows.
>>  |
>>  |The RFC acknowledges that simple implementations might generate
>> improperly \
>>  |folded lines:
>>  |
>>  | Note: It is possible for very simple
>>  | implementations to generate improperly
>>  |  folded lines in the middle of a UTF-8
>>  |  multi-octet sequence.  For this reason,
>>  |  implementations need to unfold lines
>>  |  in such a way to properly restore the
>>  |  original sequence.
>>
>> That is not what the RFC says.  It says that simple
>> implementations simply split lines when the limit is reached,
>> which might be in the middle of an UTF-8 sequence.  The RFC is
>> thus improved compared to other RFCs in the email standard
>> section, which do not give any hints on how to do that.  Even
>> RFC 2231, which avoids many of the ambiguities and problems of RFC
>> 2047 (for a different purpose, but still), does not say it so
>> exactly for the reversing character set conversion (which i for
>> one perform _once_ after joining together 

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
But at the same time that RFC makes a direct reference as UTF-8 as being
the default charset, so an implementation of the RFC cannot be agnostic to
what is UTF-8 and will not break in the middle of a conforming UTF-8
sequence.

When the limit is reached, that implementations knows that it cannot cut at
a position of an UTF-8 trailing byte, and knows that it can safely
rollaback at most 3 bytes before to locate conforming leading UTF-8 byte to
split the line **before** it, or any 7-bit ASCII byte to split the line
just **after** it). This requires very small buffering and this is a
fundamendal property of UTF-8.

Other character sets -- including /UTF-(16|32)([LB]E)?/ !!! --- are not
directly supported, except by external decoders which would convert their
input stream to UTF-8 (with all the same issues that may occur for such
conversion when it is not roundtrip compatible or the input does not
conform the specificvation of the input charset, but this is not the
problem of this RFC: these decoders may also rollback internally or attempt
to guess another charset or will use substitution, but they are supposed to
generate conforming UTF-8 on output).


2017-07-24 17:01 GMT+02:00 Steffen Nurpmeso via Unicode :

> "Costello, Roger L. via Unicode"  wrote:
>  |Suppose an application splits a UTF-8 multi-octet sequence. The
> application \
>  |then sends the split sequence to a client. The client must restore \
>  |the original sequence.
>  |
>  |Question: is it possible to split a UTF-8 multi-octet sequence in such \
>  |a way that the client cannot unambiguously restore the original sequence?
>  |
>  |Here is the source of my question:
>  |
>  |The iCalendar specification [RFC 5545] says that long lines must be
> folded:
>  |
>  | Long content lines SHOULD be split
>  |  into a multiple line representations
>  |  using a line "folding" technique.
>  |  That is, a long line can be split between
>  |  any two characters by inserting a CRLF
>  |  immediately followed by a single linear
>  |  white-space character (i.e., SPACE or HTAB).
>  |
>  |The RFC says that, when parsing a content line, folded lines must first \
>  |be unfolded using this technique:
>  |
>  | Unfolding is accomplished by removing
>  |  the CRLF and the linear white-space
>  |  character that immediately follows.
>  |
>  |The RFC acknowledges that simple implementations might generate
> improperly \
>  |folded lines:
>  |
>  | Note: It is possible for very simple
>  | implementations to generate improperly
>  |  folded lines in the middle of a UTF-8
>  |  multi-octet sequence.  For this reason,
>  |  implementations need to unfold lines
>  |  in such a way to properly restore the
>  |  original sequence.
>
> That is not what the RFC says.  It says that simple
> implementations simply split lines when the limit is reached,
> which might be in the middle of an UTF-8 sequence.  The RFC is
> thus improved compared to other RFCs in the email standard
> section, which do not give any hints on how to do that.  Even
> RFC 2231, which avoids many of the ambiguities and problems of RFC
> 2047 (for a different purpose, but still), does not say it so
> exactly for the reversing character set conversion (which i for
> one perform _once_ after joining together the chunks, but is not
> a written word and, thus, ...).
>
> --steffen
> |
> |Der Kragenbaer,The moon bear,
> |der holt sich munter   he cheerfully and one by one
> |einen nach dem anderen runter  wa.ks himself off
> |(By Robert Gernhardt)
>


Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Steffen Nurpmeso via Unicode
"Costello, Roger L. via Unicode"  wrote:
 |Suppose an application splits a UTF-8 multi-octet sequence. The application \
 |then sends the split sequence to a client. The client must restore \
 |the original sequence. 
 |
 |Question: is it possible to split a UTF-8 multi-octet sequence in such \
 |a way that the client cannot unambiguously restore the original sequence?
 |
 |Here is the source of my question:
 |
 |The iCalendar specification [RFC 5545] says that long lines must be folded:
 |
 | Long content lines SHOULD be split
 |  into a multiple line representations
 |  using a line "folding" technique.
 |  That is, a long line can be split between
 |  any two characters by inserting a CRLF
 |  immediately followed by a single linear
 |  white-space character (i.e., SPACE or HTAB).
 |
 |The RFC says that, when parsing a content line, folded lines must first \
 |be unfolded using this technique:
 |
 | Unfolding is accomplished by removing
 |  the CRLF and the linear white-space
 |  character that immediately follows.
 |
 |The RFC acknowledges that simple implementations might generate improperly \
 |folded lines:
 |
 | Note: It is possible for very simple
 | implementations to generate improperly
 |  folded lines in the middle of a UTF-8
 |  multi-octet sequence.  For this reason,
 |  implementations need to unfold lines
 |  in such a way to properly restore the
 |  original sequence.

That is not what the RFC says.  It says that simple
implementations simply split lines when the limit is reached,
which might be in the middle of an UTF-8 sequence.  The RFC is
thus improved compared to other RFCs in the email standard
section, which do not give any hints on how to do that.  Even
RFC 2231, which avoids many of the ambiguities and problems of RFC
2047 (for a different purpose, but still), does not say it so
exactly for the reversing character set conversion (which i for
one perform _once_ after joining together the chunks, but is not
a written word and, thus, ...).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)