On 2021-12-11, Laslo Hunhold <[email protected]> wrote: > So would you say that the only good way would be to only accept arrays > of unsigned char in the API? I think this seems to be the logical > conclusion.
That's one option, but another is to keep using arrays of char, but cast to unsigned char * before accessing. This is perfectly fine in C since unsigned char is a character type and you are allowed to access the representation of any object through a pointer to character type, regardless of the object's actual type. Accepting unsigned char * is maybe a bit nicer for libgrapheme's implementation, but char * is nicer for the users, since that's likely the type they already have. It also allows them to continue to use string.h functions such as strlen or strcmp on the same buffer (which also are defined to interpret characters as unsigned char). > When I read more I found out that C++ introduced static_cast and > reinterpret_cast for this simple reason: Assuming some crazy > signed-int-representation we just make up in our heads (some random > permutation of 0..255 to -127..128), it is impossible to really know the > intent of the user passing us a (signed) char-array. Let's say > "0b01010101" means "0" in our crazy signed type, does the user intend > to convey to us a null-byte (which is simply "encoded" in the signed > type), or does he literally mean "0b01010101"? With static_cast and > reinterpret_cast you can handle both cases separately. I guess it depends on how that data was obtained in the first place. Say you have char buf[1024], and read UTF-8 encoded data from a file into it. fread is defined in terms of fgetc, which "obtains that character as unsigned char" and stores into an array of unsigned char overlaying the object. In this case, accessing as unsigned char is the intention. I can't really think of a case where the intention would be to interpret as signed char and convert to unsigned char. With sign-magnitude, it'd be impossible to encode Ā (UTF-8 0xC4 0x80) this way, since there is no char value that results in 0x80 when converted to unsigned char. I know it's just a thought experiment, but note that there are only three signed-int representations valid in C: sign-magnitude, one's complement, and two's complement. They only differ by the meaning of the sign bit, which is the highest bit of the corresponding unsigned integer type, so you couldn't go as crazy as the representation you described. > 1) Would you also go down the route of just demanding an array of > unsigned integers of at least 8 bits? I'd suggest sticking with char *, but unsigned char * seems reasonable as well. > 2) Would you define it as "unsigned char *" or "uint_least8_t *"? > I'd almost favor the latter, given the entire library is already > using the stdint-types. I don't think uint_least8_t is a good idea, since there is no guarantee that it is a character type. The API user is unlikely to have the data in a buffer of this type, so they'd potentially have to allocate a new one and copy into it. With unsigned char *, they could just cast if necessary.
