On 2021-12-16, Laslo Hunhold <[email protected]> wrote: > However, the case > I'm making is that we can assume that > > 1) uint8_t exists > 2) uint8_t == unsigned char
I think assumption 1 is valid, but not necessarily 2. > This may not be directly specified in the standard, but follows from > the following observations: > > 1) We make use of POSIX-functions in the code, so compiling > libgrapheme requires a POSIX-compliant compiler and stdlib. POSIX > requires CHAR_BIT == 8, which means that we can assume that chars > are 8 bit, and thus uint8_t exists. > 2) C99 specifies char to be of at least 8 bit size. Given char is meant > to be the smallest addressable unit and uint8_t exists, char is > exactly 8 bits. Both of these observations are true, but just because uint8_t is 8-bit and unsigned char is 8-bit doesn't mean that uint8_t == unsigned char. A C implementation can have implementation-defined extended integer types, so it is possible that it defines uint8_t as an 8-bit extended integer type, distinct from unsigned char (similar to how long long and long may be distinct 64-bit integer types). As far as I know, this would be still be POSIX compliant. > However, here you have a problem when suddenly char is 16 bits (might > be according to the standard). Because then you read in two > UTF-8-code-units at once, but lg_utf8_decode silently discards half of > the data in the high bits. > But this wouldn't even happen, given POSIX mandates char to be 8 bits, > and given even C99 mandates char to be of integral type, you only have > one unique way to specify an unsigned integer of certain bit-length, > given C99 also mandates that char shouldn't have any padding. Ah, okay, I see what you mean. To be honest I'm not really sure how something like file encoding and I/O would work on such a system, but I was assuming that files would contain one code unit per byte, rather than packing multiple code units into a single byte. For instance, on a hypothetical system with 9-bit bytes, I wouldn't expect a code unit to cross the byte boundary. > So the case can be made that uint8_t == unsigned char, and casting > between char and unsigned char is fine, so you just cast any char * to > uint8_t * which will work as you would otherwise not have been able to > even compile libgrapheme in the first place. > > Or am I missing something here except from the standard semantically > making a difference? Is there any technical possibility to have a > system that has CHAR_BIT == 8 where uint8_t != unsigned char? Yes, I believe this is a possibility. If you are assuming that unsigned char == uint8_t, I think you should just use unsigned char in your API. You could document the API as expecting one UTF-8 code unit per byte if you are worried about confusion regarding CHAR_BIT.
