On Sat, 9 May 2020 07:25:41 +0200 Laslo Hunhold <[email protected]> wrote:
> On Thu, 7 May 2020 18:32:23 +0200 > Mattias Andrée <[email protected]> wrote: > > Dear Mattias, > > > Perhaps, but do you I wouldn't expect anyone that don't understand > > the difference to even use libgrapheme. But you would also state in > > grapheme.h and the man pages that all functions except grapheme_len > > are low-level functions. > > that could work. > > > Not a goal, but a positive side-effect of exposing the boundary > > test function. > > I agree with that, it has a positive side-effect. > > > > The reason I'm conflicted with this change is that there's no > > > guarantee the grapheme-cluster-boundary algorithm gets changed > > > again. It already has been in Unicode 12.0, which made it suddenly > > > require a state to be carried with it, but there's no guarantee it > > > will get even crazier, making it almost infeasible to expose more > > > than a "gclen()"-function to the user. > > > > How about > > > > typedef struct grapheme_state GRAPHEME_STATE; > > > > /* Hidden from the user { */ > > struct grapheme_state { > > uint32_t cp0; > > int state; > > }; > > /* } */ > > > > int grapheme_boundary(uint32_t cp1, GRAPHEME_STATE *); > > > > GRAPHEME_STATE *grapheme_create_state(void); > > > > /* Just in case the state in the future > > * would require dynamic allocation */ > > void grapheme_free_state(GRAPHEME_STATE *); > > > > grapheme_create_state() would reset the state each time > > a boundary is found, so no reset function would be needed, > > and would be useful to avoid a new allocation if the > > grapheme cluster identification process is aborted and a > > a started for a new text. Since this would be very rare > > there, no reset function is needed. > > > > The only future I can see there this wouldn't be sufficient > > if a cluster break (or non-break) could be retroactively > > inserted where where the algorithm already stated that there > > as no break (or was a break). This would be so bizarre, I > > cannot imagine this would ever be the case. > > I don't like this change, because it destroys reentrancy, which is very > importent for multithreaded applications, and complicates things > unnecessarily. malloc(3) and free(3) are thread-safe, so there shouldn't be any problem: GRAPHEME_STATE *state = grapheme_create_state(void); ... = grapheme_boundary(..., state); grapheme_free_state(state); > However, I think we should just risk it and assume that further > versions of the Unicode-Grapheme-Boundary-algorithm will only rely on > such a state. I agree. > > > > [...] > > > What do you think? > > > > I don't see the point of including grapheme_cp_encode(), however > > I'm not opposed to making a larger UTF-8/Unicode library, rather > > I think it would be nice to have one place for all my Unicode > > needs, especially if I otherwise would have a hand full of libraries > > that all their own UTF-8 decoding functions that all have to be > > linked. > > Yes, I agree with that. There are lots of bad and unsafe > UTF-de/encoders out there and the one in libgrapheme is actually pretty > fast and safe (e.g. no overencoded nul, proper error-handling, etc.). > It would be no bloat to expose it outside, as it runs "in the > background" anyway. It's more of a debate on the "purity" of > libgrapheme, but when including the boundary function, offering a way > to read codepoints from a char-array makes a lot of sense. > > With best regards > > Laslo >
