Hi, Pete Batard wrote: > Or maybe there's a mathematical proof that > a UTF-8 glyph byte encoding can never be larger than 1.5 the UTF-16 glyph > byte encoding
I thought to have given one. Let me try again: https://datatracker.ietf.org/doc/html/rfc3629 "In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets." The table after this statement shows that it can encode 21 bits that way. The older FSS-UTF proposal of 1992 had up to 6 octets for up to 31 bits but was restricted in 2003 to 21 bits by above RFC. This is also defined in ISO/IEC 10646:2014 to ISO/IEC 10646:2020. My proof is that UCS-2 encodes the Unicode points U+0000 to U+FFFF in 2 bytes which is in UTF-8 encoded in at most 3 bytes. If the producer of the ISO uses UTF-16 instead of the older UCS-2, then the input Unicode range is like with UTF-8: U+0000..U+10FFFF. Characters which do not fit into 2 bytes (and thus possibly not into 3 UTF-8 bytes) get represented as 4 bytes. Given that UTF-8 cannot exceed 4 bytes, the number of bytes cannot grow during conversion. (My proposal would accomodate up to 6 UTF-8 bytes for 4 UTF-16 bytes and thus even suffice for FSS-UTF.) > So I'm going to stick to i_fname for length, with the expectation that we're > unlikely to see realistic truncations outside of images designed to trigger > one, I try to obey specs and to avoid speculations about what of their provisions would possibly not happen in practice. To my experience this pays off on the long run. > I'm not > sure I like the idea of trying to be too smart about or expecting specs not > to change the deal. My proposal with name allocation of 3*if_name/2 and a result size parameter of _iso9660_recname_to_cstring() would be as safe against result overflow as would be yours. It would additionally guarantee that all valid UCS-2 names lead to valid and untruncated UTF-8 names. (One would separately have to check what the character conversion in libcdio makes out of invalid UTF-16 byte sequences. Whatever the proposed size check would avoid memory corruption in _iso9660_recname_to_cstring().) Have a nice day :) Thomas