On Wed, 15 Dec 2021 12:24:21 -0800 Michael Forney <[email protected]> wrote:
Dear Michael, > I think this is a mistake. It makes it very difficult to use the API > correctly if you have data in an array of char or unsigned char, which > is usually the case. > Here's an example of some real code that has a char * buffer: > https://git.sr.ht/~exec64/imv/tree/a83304d4d673aae6efed51da1986bd7315a4d642/item/src/console.c#L54-58 > > How would you suggest that this code be written for the new API? The > only thing I can think is > > if (buffer[position] != 0) { > size_t bufferlen = strlen(buffer) + 1 - position; > uint8_t *newbuffer = malloc(bufferlen); > if (!newbuffer) ... > memcpy(newbuffer, buffer + position, bufferlen); > position += grapheme_bytelen(newbuffer); > free(newbuffer); > } > return position; > > This sort of thing would turn me off of using the library entirely. yeah, it would be insane to malloc() a new buffer. However, the case I'm making is that we can assume that 1) uint8_t exists 2) uint8_t == unsigned char This may not be directly specified in the standard, but follows from the following observations: 1) We make use of POSIX-functions in the code, so compiling libgrapheme requires a POSIX-compliant compiler and stdlib. POSIX requires CHAR_BIT == 8, which means that we can assume that chars are 8 bit, and thus uint8_t exists. 2) C99 specifies char to be of at least 8 bit size. Given char is meant to be the smallest addressable unit and uint8_t exists, char is exactly 8 bits. > > Any other way would have introduced too many implicit assumptions. > > Like what? I was unclear there. What I actually meant was that "char" carries implicit assumptions in the programming world that are actually not even reflected in the standard. When specifying the UTF-8-array as char *, you basically carry on this tradition instead of being specific with what you actually want. > If you really want your code to break when CHAR_BIT != 8, you could > use a static assert (there are also ways to emulate this in C99). But > even if CHAR_BIT > 8, unsigned char is perfectly capable to represent > all the values used in UTF-8 encoding, so I don't see the problem. Let's take a simple example: Say you have a file in UTF-8 encoding of known size and wanted to read it and simply print the code points. You would probably do it as follows in C (no checks to get the point across), and let's assume here that lg_utf8_* accepts char *: FILE *fp; size_t size, off, ret, i; char *data; uint_least32_t cp; /* open */ fp = fopen("file.txt", "r"); /* get file size and allocate buffer */ fseek(fp, 0L, SEEK_END); size = ftell(fp); rewind(fp); data = malloc(size); /* fill buffer */ for (off = 0; (ret = fread(data + off, 1, size, fp)) > 0; off += ret) ; /* print code points */ for (i = lg_utf8_decode(data, size, &cp); data[i] != '\0'; i += lg_utf8_decode(data + i, size - i, &cp)) { printf("code point: %"PRIu32"\n", cp); } However, here you have a problem when suddenly char is 16 bits (might be according to the standard). Because then you read in two UTF-8-code-units at once, but lg_utf8_decode silently discards half of the data in the high bits. But this wouldn't even happen, given POSIX mandates char to be 8 bits, and given even C99 mandates char to be of integral type, you only have one unique way to specify an unsigned integer of certain bit-length, given C99 also mandates that char shouldn't have any padding. So the case can be made that uint8_t == unsigned char, and casting between char and unsigned char is fine, so you just cast any char * to uint8_t * which will work as you would otherwise not have been able to even compile libgrapheme in the first place. Or am I missing something here except from the standard semantically making a difference? Is there any technical possibility to have a system that has CHAR_BIT == 8 where uint8_t != unsigned char? > > And even if all fails and there simply is no 8-bit-type, one can > > always use the lg_grapheme_isbreak()-function and roll his own > > de/encoding. > > I'm still confused as to what you mean by rolling your own > de/encoding. What would that look like? > > If there is no 8-bit type, libgrapheme could not be compiled or used > at all since uint8_t would be missing. Yeah, it was a bit of a transitive argument given you would have to tailor grapheme and remove the utf8-encoder/decoder. But then you could simply use the lg_grapheme_isbreak()-function which works on code points. How you obtain the code points is up to the user, but then libgrapheme doesn't care and simply returns a "decision". tl;dr: I don't see what's wrong with simply casting char * to uint8_t * given it's reasonable to assume that uint8_t == unsigned char for the aforementioned reasons. With best regards Laslo
