Re: [hackers] [libgrapheme][PATCH] Expose cp_decode and boundary (both renamed) to the user

Mattias Andrée Sat, 09 May 2020 01:12:18 -0700

On Sat, 9 May 2020 07:25:41 +0200
Laslo Hunhold <[email protected]> wrote:


> On Thu, 7 May 2020 18:32:23 +0200
> Mattias Andrée <[email protected]> wrote:
> 
> Dear Mattias,
> 
> > Perhaps, but do you I wouldn't expect anyone that don't understand
> > the difference to even use libgrapheme. But you would also state in
> > grapheme.h and the man pages that all functions except grapheme_len
> > are low-level functions.  
> 
> that could work.
> 
> > Not a goal, but a positive side-effect of exposing the boundary
> > test function.  
> 
> I agree with that, it has a positive side-effect.
> 
> > > The reason I'm conflicted with this change is that there's no
> > > guarantee the grapheme-cluster-boundary algorithm gets changed
> > > again. It already has been in Unicode 12.0, which made it suddenly
> > > require a state to be carried with it, but there's no guarantee it
> > > will get even crazier, making it almost infeasible to expose more
> > > than a "gclen()"-function to the user.    
> > 
> > How about
> > 
> >     typedef struct grapheme_state GRAPHEME_STATE;
> > 
> >     /* Hidden from the user { */
> >     struct grapheme_state {
> >             uint32_t cp0;
> >             int state;
> >     };
> >     /* } */
> > 
> >     int grapheme_boundary(uint32_t cp1, GRAPHEME_STATE *);
> > 
> >     GRAPHEME_STATE *grapheme_create_state(void);
> > 
> >     /* Just in case the state in the future
> >      * would require dynamic allocation */
> >     void grapheme_free_state(GRAPHEME_STATE *);
> > 
> > grapheme_create_state() would reset the state each time
> > a boundary is found, so no reset function would be needed,
> > and would be useful to avoid a new allocation if the
> > grapheme cluster identification process is aborted and a
> > a started for a new text. Since this would be very rare
> > there, no reset function is needed.
> > 
> > The only future I can see there this wouldn't be sufficient
> > if a cluster break (or non-break) could be retroactively
> > inserted where where the algorithm already stated that there
> > as no break (or was a break). This would be so bizarre, I
> > cannot imagine this would ever be the case.  
> 
> I don't like this change, because it destroys reentrancy, which is very
> importent for multithreaded applications, and complicates things
> unnecessarily.

malloc(3) and free(3) are thread-safe, so there shouldn't be any problem:

        GRAPHEME_STATE *state = grapheme_create_state(void);
        ... = grapheme_boundary(..., state);
        grapheme_free_state(state);

> However, I think we should just risk it and assume that further
> versions of the Unicode-Grapheme-Boundary-algorithm will only rely on
> such a state.

I agree.

> 
> > > [...]
> > > What do you think?    
> > 
> > I don't see the point of including grapheme_cp_encode(), however
> > I'm not opposed to making a larger UTF-8/Unicode library, rather
> > I think it would be nice to have one place for all my Unicode
> > needs, especially if I otherwise would have a hand full of libraries
> > that all their own UTF-8 decoding functions that all have to be
> > linked.  
> 
> Yes, I agree with that. There are lots of bad and unsafe
> UTF-de/encoders out there and the one in libgrapheme is actually pretty
> fast and safe (e.g. no overencoded nul, proper error-handling, etc.).
> It would be no bloat to expose it outside, as it runs "in the
> background" anyway. It's more of a debate on the "purity" of
> libgrapheme, but when including the boundary function, offering a way
> to read codepoints from a char-array makes a lot of sense.
> 
> With best regards
> 
> Laslo
>

Re: [hackers] [libgrapheme][PATCH] Expose cp_decode and boundary (both renamed) to the user

Reply via email to