Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
2017-07-25 0:35 GMT+02:00 Doug Ewell via Unicode: > J Decker wrote: > > > I generally accepted any utf-8 encoding up to 31 bits though ( since > > I was going from the original spec, and not what was effective limit > > based on unicode codepoint space) > > Hey, everybody: Don't do that. > > UTF-8 has been constrained to the Unicode code space (maximum U+10, > four bytes) for almost fourteen years now. I fully agree. This is now an essential part of UTF-8 that has helped secure it (including the dangerous unbound loops scanning through buffers in memory), and also helped improve performance (when unrolling loops that you no longer need to count separately, the code expansion is not so large that you can't do correct branch prediction and can benefit of caching in code. Due to the way the UCS code spacez is allocated and how they are used, the branches in your code have very distinctive patterns that are easy to enumerate; test coverage for those branches is possible without explosing combinatorially: this eliminates the need of heuristics. And about the RFC we were discussing, it is rather recent compared to the approved stabilization of UTF-8 and finally its endorsement by the industry. UTF-8 is strictly bound to 4 bytes and nothing more. This allows other things to be developed on top of this fact and used now as a checked assumption that cannot be broken except by software bugs that will soon create security problems when checked assumptions will no longer be checked throughout a processing chain. The old RFC was not "UTF-8" (even if that name was proposed, it was not really assigned) but an early proposal in discussion that did not reach the level of standard or best practice, it was experimental and at that time there were several other candidates (including also UTF-7 which is now almost abandoned, and BOCU-8 which is now marginal but was also bound to the 17 planes limit). The encoding old RFC should just be given another name, but it is not used for encoding only text, it was describing in fact a binary format (but for generic variable binary encoding format of numbers there are now better candidates, which are also not limited to just 31 bits or even just to unsigned integers, and are also faster to process and more compact, and have more interesting properties for code analysis and resistance to encoding and transmission/storage errors). In the IANA database for charsets, the old RFC encoding has a separate identifier, but "UTF-8" refers to RFC 3629 (IETF standard 63); the former proposals in RFC 2279 or RFC 2044 have never been approved standards, but just drafts mapped in IANA as the obsolete "UNICODE-1-1-UTF-8" (retired later as it was never approved by Unicode). The only remaining "charset" in the IANA database that refers to 31 bit code points is "ISO-10646-UCS-4", but it does not use variable encoding and does not specify any byte order, it is just a basic subtype for a range of positive integers, and without any restriction of use, and not necessarily repreenting text, but it is very inefficient way to encode them, only meant as an internal temporary transform in transient memory or CPU registers (at least for 32bit CPUs or higher: it is now almost alway the case today even in embedded systems, as 4-, 8- or16-bit CPUs are almost dead or will not be used for international text processing; even the simplest keyboard controlers that manage ~100-150 keys and a few leds, and reporting at 1kHz for the fastest ones, are now internally using 32bit CPUs)
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
J Decker wrote: > I generally accepted any utf-8 encoding up to 31 bits though ( since > I was going from the original spec, and not what was effective limit > based on unicode codepoint space) Hey, everybody: Don't do that. UTF-8 has been constrained to the Unicode code space (maximum U+10, four bytes) for almost fourteen years now. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
On Mon, Jul 24, 2017 at 1:50 PM, Philippe Verdywrote: > 2017-07-24 21:12 GMT+02:00 J Decker via Unicode : > >> >> >> If you don't have that last position in a variable, just use 3 tests but > NO loop at all: if all 3 tests are failing, you know the input was not > valid at all, and the way to handle this error will not be solved simply by > using a very unsecure unbound loop like above but by exiting and returning > an error immediately, or throwing an exception. > > The code should better be: > > if (from[0]&0xC0 == 0x80) from--; > else if (from[-1]&0xC0 == 0x80) from -=2; > else if (from[-2]&0xC0 == 0x80) from -=3; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > > I generally accepted any utf-8 encoding up to 31 bits though ( since I was going from the original spec, and not what was effective limit based on unicode codepoint space) and the while loop is more terse; but is less optimal because of code pipeline flushing from backward jump; so yes if series is much better :) (the original code also has the start of the string, and strings are effecitvly prefixed with a 0 byte anyway because of a long little endian size) and you'd probably be tracking an output offset also, so it becomes a little longer than the above. And it should be secured using a guard byte at start of your buffer in > which the "from" pointer was pointing, so that it will never read something > else and can generate an error. > >
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
2017-07-24 22:50 GMT+02:00 Philippe Verdy: > 2017-07-24 21:12 GMT+02:00 J Decker via Unicode : > >> >> >> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode < >> unicode@unicode.org> wrote: >> >>> Hi Folks, >>> >>> 2. (Bug) The sending application performs the folding process - inserts >>> CRLF plus white space characters - and the receiving application does the >>> unfolding process but doesn't properly delete all of them. >>> >>> The RFC doesn't say 'characters' but either a space or a tab character >> (singular) >> >> back scanning is simple enough >> >> while( ( from[0] & 0xC0 ) == 0x80 ) >> from--; >> > > Certainly not like this! Backscanning should only directly use a single > assignement to the last known start position, no loop at all ! UTF-8 > security is based on the fact that its sequences are strictly limited in > length so that you will never have more than 3 trailing bytes. > > If you don't have that last position in a variable, just use 3 tests but > NO loop at all: if all 3 tests are failing, you know the input was not > valid at all, and the way to handle this error will not be solved simply by > using a very unsecure unbound loop like above but by exiting and returning > an error immediately, or throwing an exception. > > The code should better be: > > if (from[0]&0xC0 == 0x80) from--; > else if (from[-1]&0xC0 == 0x80) from -=2; > else if (from[-2]&0xC0 == 0x80) from -=3; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > Sorry, sent too fast, I should not have copy-pasted lines trying to adapt your loop; the correct code uses no "else" at all: > if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > >
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
2017-07-24 21:12 GMT+02:00 J Decker via Unicode: > > > On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode < > unicode@unicode.org> wrote: > >> Hi Folks, >> >> 2. (Bug) The sending application performs the folding process - inserts >> CRLF plus white space characters - and the receiving application does the >> unfolding process but doesn't properly delete all of them. >> >> The RFC doesn't say 'characters' but either a space or a tab character > (singular) > > back scanning is simple enough > > while( ( from[0] & 0xC0 ) == 0x80 ) > from--; > Certainly not like this! Backscanning should only directly use a single assignement to the last known start position, no loop at all ! UTF-8 security is based on the fact that its sequences are strictly limited in length so that you will never have more than 3 trailing bytes. If you don't have that last position in a variable, just use 3 tests but NO loop at all: if all 3 tests are failing, you know the input was not valid at all, and the way to handle this error will not be solved simply by using a very unsecure unbound loop like above but by exiting and returning an error immediately, or throwing an exception. The code should better be: if (from[0]&0xC0 == 0x80) from--; else if (from[-1]&0xC0 == 0x80) from -=2; else if (from[-2]&0xC0 == 0x80) from -=3; if (from[0]&0xC0 == 0x80) throw (some exception); // continue here with character encoded as UTF-8 starting at "from" (an ASCII byte or an UTF-8 leading byte) And it should be secured using a guard byte at start of your buffer in which the "from" pointer was pointing, so that it will never read something else and can generate an error.
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode < unicode@unicode.org> wrote: > Hi Folks, > > 2. (Bug) The sending application performs the folding process - inserts > CRLF plus white space characters - and the receiving application does the > unfolding process but doesn't properly delete all of them. > > The RFC doesn't say 'characters' but either a space or a tab character (singular) back scanning is simple enough while( ( from[0] & 0xC0 ) == 0x80 ) from--; should probably also check that from > (start+1) but since it should be applied at 75-ish characters, that would be implicitly true.
RE: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Hi Folks, Thank you very much for your fantastic comments! Below I summarized the issue and your comments. At the bottom is a set of proposed requirements (for my clients) on applications that receive iCalendar files. Some questions: - Have I captured all your comments? Any more comments? - Are the proposed requirements sensible? Any more requirements? /Roger Issue: Folding and unfolding content lines in iCalendar files The iCalendar specification [RFC 5545] says that a content line should not be longer than 75 octets: Lines of text SHOULD NOT be longer than 75 octets, excluding the line break. The RFC says that long lines should be folded: Long content lines SHOULD be split into a multiple line representations using a line "folding" technique. That is, a long line can be split between any two characters by inserting a CRLF immediately followed by a single linear white-space character (i.e., SPACE or HTAB). The RFC says that, when parsing a content line, folded lines must first be unfolded: When parsing a content line, folded lines MUST first be unfolded. using this technique: Unfolding is accomplished by removing the CRLF and the linear white-space character that immediately follows. The RFC acknowledges that some implementations might do folding in the middle of a multi-octet sequence: Note: It is possible for very simple implementations to generate improperly folded lines in the middle of a UTF-8 multi-octet sequence. For this reason, implementations need to unfold lines in such a way to properly restore the original sequence. Here is an example of folding in the middle of a UTF-8 multi-octet sequence: The iCalendar file contains the Yen sign (U+00A5), which is represented by the byte sequence 0xC2 0xA5 in UTF-8. The content line containing the Yen sign is folded in the middle of the two bytes. The result is 0xC2 0x0D 0x0A 0x20 0xA5, which isn't valid UTF-8 any longer. Proposed requirements on the behavior of applications that receive iCalendar files: 1. (Bug) The receiving application does not recognize that it has received an iCalendar file. 2. (Bug) The sending application performs the folding process - inserts CRLF plus white space characters - and the receiving application does the unfolding process but doesn't properly delete all of them. 3. (Non-conformant behavior) The receiving application, after folding and before unfolding, attempts to interpret the partial UTF-8 sequences and convert them into replacement characters or worse.
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Costello, Roger L. wrote: > Suppose an application splits a UTF-8 multi-octet sequence. The > application then sends the split sequence to a client. The client must > restore the original sequence. > > Question: is it possible to split a UTF-8 multi-octet sequence in such > a way that the client cannot unambiguously restore the original > sequence? 1. (Bug) The folding process inserts CRLF plus white space characters, and the unfolding process doesn't properly delete all of them. 2. (Non-conformant behavior) Some process, after folding and before unfolding, attempts to interpret the partial UTF-8 sequences and converts them into replacement characters or worse. In a minimally decent implementation, splitting and reassembling a UTF-8 sequence should always yield the correct result; there should be no ambiguity. A good implementation, of course, would know the character encoding of the data, and would not split multi-byte sequences in that encoding to begin with. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Also note that the maximum line-length in that RFC is a SHOULD and not a MUST. This is intended to give a reasonable hint for the limit used in implementations that process data in the given format: The RFC suggests a maximum line length of 75 "characters", excluding the CRLF+SPACE continuation sequence (not clear here what it means given that it refers to UTF-8: should it be "code units", i.e. bytes?) Due to this ambiguity, all implementations will need to interpret it as id they are actually 75 Unicode characters, which could all be up to 4 bytes in UTF-8, i.e. 300 bytes. Most implementations will use input buffers for lines up to 512 bytes (including the CRLF+SPACE continuation), so it will be simpler to handle the case of continuation just AFTER the line length limit has been reached, without ever rolling back. And in all cases, there should never be any continuation sequence CRLF+SPACE in the middle of any UTF-8 sequence without breaking the initial UTF-8 condition which is assumed by theis RFC, i.e. without breaking conformance to that RFC. If an implementation thinks that 75 is a number of bytes, it is wrong, but anyway given the UTF-8 reference, it could still use it but should not break in the middle of an UTF-8 sequence, but it will be still safe for them to break just after it, even if the line (excluding the the CRLF+SPACE contituation sequence) will be up to 78 bytes long. Decoders will still be able to parse it without breaking if they have the most common 512-byte input buffer. 2017-07-24 17:27 GMT+02:00 Philippe Verdy: > But at the same time that RFC makes a direct reference as UTF-8 as being > the default charset, so an implementation of the RFC cannot be agnostic to > what is UTF-8 and will not break in the middle of a conforming UTF-8 > sequence. > > When the limit is reached, that implementations knows that it cannot cut > at a position of an UTF-8 trailing byte, and knows that it can safely > rollaback at most 3 bytes before to locate conforming leading UTF-8 byte to > split the line **before** it, or any 7-bit ASCII byte to split the line > just **after** it). This requires very small buffering and this is a > fundamendal property of UTF-8. > > Other character sets -- including /UTF-(16|32)([LB]E)?/ !!! --- are not > directly supported, except by external decoders which would convert their > input stream to UTF-8 (with all the same issues that may occur for such > conversion when it is not roundtrip compatible or the input does not > conform the specificvation of the input charset, but this is not the > problem of this RFC: these decoders may also rollback internally or attempt > to guess another charset or will use substitution, but they are supposed to > generate conforming UTF-8 on output). > > > 2017-07-24 17:01 GMT+02:00 Steffen Nurpmeso via Unicode < > unicode@unicode.org>: > >> "Costello, Roger L. via Unicode" wrote: >> |Suppose an application splits a UTF-8 multi-octet sequence. The >> application \ >> |then sends the split sequence to a client. The client must restore \ >> |the original sequence. >> | >> |Question: is it possible to split a UTF-8 multi-octet sequence in such \ >> |a way that the client cannot unambiguously restore the original >> sequence? >> | >> |Here is the source of my question: >> | >> |The iCalendar specification [RFC 5545] says that long lines must be >> folded: >> | >> | Long content lines SHOULD be split >> | into a multiple line representations >> | using a line "folding" technique. >> | That is, a long line can be split between >> | any two characters by inserting a CRLF >> | immediately followed by a single linear >> | white-space character (i.e., SPACE or HTAB). >> | >> |The RFC says that, when parsing a content line, folded lines must first >> \ >> |be unfolded using this technique: >> | >> | Unfolding is accomplished by removing >> | the CRLF and the linear white-space >> | character that immediately follows. >> | >> |The RFC acknowledges that simple implementations might generate >> improperly \ >> |folded lines: >> | >> | Note: It is possible for very simple >> | implementations to generate improperly >> | folded lines in the middle of a UTF-8 >> | multi-octet sequence. For this reason, >> | implementations need to unfold lines >> | in such a way to properly restore the >> | original sequence. >> >> That is not what the RFC says. It says that simple >> implementations simply split lines when the limit is reached, >> which might be in the middle of an UTF-8 sequence. The RFC is >> thus improved compared to other RFCs in the email standard >> section, which do not give any hints on how to do that. Even >> RFC 2231, which avoids many of the ambiguities and problems of RFC >> 2047 (for a different purpose, but still), does not say it so >> exactly for the reversing character set conversion (which i for >> one perform _once_ after joining together
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
But at the same time that RFC makes a direct reference as UTF-8 as being the default charset, so an implementation of the RFC cannot be agnostic to what is UTF-8 and will not break in the middle of a conforming UTF-8 sequence. When the limit is reached, that implementations knows that it cannot cut at a position of an UTF-8 trailing byte, and knows that it can safely rollaback at most 3 bytes before to locate conforming leading UTF-8 byte to split the line **before** it, or any 7-bit ASCII byte to split the line just **after** it). This requires very small buffering and this is a fundamendal property of UTF-8. Other character sets -- including /UTF-(16|32)([LB]E)?/ !!! --- are not directly supported, except by external decoders which would convert their input stream to UTF-8 (with all the same issues that may occur for such conversion when it is not roundtrip compatible or the input does not conform the specificvation of the input charset, but this is not the problem of this RFC: these decoders may also rollback internally or attempt to guess another charset or will use substitution, but they are supposed to generate conforming UTF-8 on output). 2017-07-24 17:01 GMT+02:00 Steffen Nurpmeso via Unicode: > "Costello, Roger L. via Unicode" wrote: > |Suppose an application splits a UTF-8 multi-octet sequence. The > application \ > |then sends the split sequence to a client. The client must restore \ > |the original sequence. > | > |Question: is it possible to split a UTF-8 multi-octet sequence in such \ > |a way that the client cannot unambiguously restore the original sequence? > | > |Here is the source of my question: > | > |The iCalendar specification [RFC 5545] says that long lines must be > folded: > | > | Long content lines SHOULD be split > | into a multiple line representations > | using a line "folding" technique. > | That is, a long line can be split between > | any two characters by inserting a CRLF > | immediately followed by a single linear > | white-space character (i.e., SPACE or HTAB). > | > |The RFC says that, when parsing a content line, folded lines must first \ > |be unfolded using this technique: > | > | Unfolding is accomplished by removing > | the CRLF and the linear white-space > | character that immediately follows. > | > |The RFC acknowledges that simple implementations might generate > improperly \ > |folded lines: > | > | Note: It is possible for very simple > | implementations to generate improperly > | folded lines in the middle of a UTF-8 > | multi-octet sequence. For this reason, > | implementations need to unfold lines > | in such a way to properly restore the > | original sequence. > > That is not what the RFC says. It says that simple > implementations simply split lines when the limit is reached, > which might be in the middle of an UTF-8 sequence. The RFC is > thus improved compared to other RFCs in the email standard > section, which do not give any hints on how to do that. Even > RFC 2231, which avoids many of the ambiguities and problems of RFC > 2047 (for a different purpose, but still), does not say it so > exactly for the reversing character set conversion (which i for > one perform _once_ after joining together the chunks, but is not > a written word and, thus, ...). > > --steffen > | > |Der Kragenbaer,The moon bear, > |der holt sich munter he cheerfully and one by one > |einen nach dem anderen runter wa.ks himself off > |(By Robert Gernhardt) >
Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
"Costello, Roger L. via Unicode"wrote: |Suppose an application splits a UTF-8 multi-octet sequence. The application \ |then sends the split sequence to a client. The client must restore \ |the original sequence. | |Question: is it possible to split a UTF-8 multi-octet sequence in such \ |a way that the client cannot unambiguously restore the original sequence? | |Here is the source of my question: | |The iCalendar specification [RFC 5545] says that long lines must be folded: | | Long content lines SHOULD be split | into a multiple line representations | using a line "folding" technique. | That is, a long line can be split between | any two characters by inserting a CRLF | immediately followed by a single linear | white-space character (i.e., SPACE or HTAB). | |The RFC says that, when parsing a content line, folded lines must first \ |be unfolded using this technique: | | Unfolding is accomplished by removing | the CRLF and the linear white-space | character that immediately follows. | |The RFC acknowledges that simple implementations might generate improperly \ |folded lines: | | Note: It is possible for very simple | implementations to generate improperly | folded lines in the middle of a UTF-8 | multi-octet sequence. For this reason, | implementations need to unfold lines | in such a way to properly restore the | original sequence. That is not what the RFC says. It says that simple implementations simply split lines when the limit is reached, which might be in the middle of an UTF-8 sequence. The RFC is thus improved compared to other RFCs in the email standard section, which do not give any hints on how to do that. Even RFC 2231, which avoids many of the ambiguities and problems of RFC 2047 (for a different purpose, but still), does not say it so exactly for the reversing character set conversion (which i for one perform _once_ after joining together the chunks, but is not a written word and, thus, ...). --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)