On Sat, 25 Apr 2020 23:03:56 +0200 Mattias Andrée <[email protected]> wrote:
Dear Mattias, > cp_decode has been renamed to grapheme_decode I personally don't like that change. It's already difficult enough to understand the differences between codepoints and grapheme clusters, and changing the name of the function to that really complicates it more and might cause misunderstandings. This also contradicts the below goal of providing a solution for non-UTF-8-text, which I honestly don't care about. There's no logical reason to encode text into anything other than UTF-8 and the best approach is to just re-encode everything into UTF-8. It's a tough call to decide if we want to turn libgrapheme into a general purpose UTF-8-library. > and boundary has been renamed to grapheme_boundary. I'm a bit conflicted with this change, though I would probably expose it as grapheme_boundary(uint32_t, uint32_t, int *) if I chose to include it. Sure, we do use Codepoint interally as a typedef, but for a "public" API I prefer not to have them if possible. The reason I'm conflicted with this change is that there's no guarantee the grapheme-cluster-boundary algorithm gets changed again. It already has been in Unicode 12.0, which made it suddenly require a state to be carried with it, but there's no guarantee it will get even crazier, making it almost infeasible to expose more than a "gclen()"-function to the user. The current implementation can store 32 states and uses 2 of them for the algorithm. In this regard, we still have some headroom. > The purpose of this is to allow faster text rendering > where both individual code points and grapheme clusters > boundaries are of interest, but it also (1) makes it > easy to do online processing of large document (the user > does not need to search for spaces, but only know an > upper limit for how long encoding is needed to encode > any codepoint) and (2) makes to library easy to use > with non-UTF-8 text. As I said above, I don't care about non-UTF-8-text and anything non-UTF-8 is either an internal representation (e.g. in Java with UTF-16LE) or ancient. I see how the stateful function might be useful though for a byte-per-byte reading of a file, or something else. > This change also eliminates all unnamespaced, non-static > functions that are not exposed to the user. This is a very good point! I'll try to solve that in an own coding session on the existing code. Thanks for your patch! I will have to think about this more. What do you think about the following API-overview integrating above changes? This would also include some UTF-8 functionality. size_t grapheme_cp_decode(uint32_t*, char *, size_t) Decode the UTF-8 sequence into a codepoint from the array of given length, returning the number of bytes consumed or zero if there was an error (which also sets the codepoint to UTF_INVALID) size_t grapheme_cp_encode(uint32_t, char *, size_t) Encode the given codepoint into UTF-8 in the given array of given length. Return the number of bytes "used" or zero if something failed (codepoint out of bounds, array too small, etc.) size_t grapheme_len(const char *) Return the length (in bytes) of the grapheme cluster beginning at the provided char-address. int grapheme_boundary(uint32_t, uint32_t, int *state) Based on the current state (which is 0 at the beginning) determine if between two codepoints is a grapheme-cluster-boundary. If so, return 1, and 0 otherwise. What do you think? With best regards Laslo
