Hello fellow hackers,

I'm very glad to announce libgrapheme[0], a library for handling
grapheme clusters. To put it short: A grapheme cluster is what Unicode
considers to be a single printed character. I have given a talk about
the topic and this library at slcon 2019[1], but you can also refer to
[2] and [3] for further reading.

As an example, consider the family-emoji "👨‍👩‍👦". This single emoji
is a single grapheme-cluster and should be printed as a single
character in conforming applications, but is actually comprised of the
unicode code-points man ("👨"), woman ("👩") and boy ("👦) with
zero-width-joiners (U+200D) inbetween.
Each code-point is encoded as UTF-8 and is thus comprised of one or
more bytes, so to determine how long a grapheme cluster is, one has to
decode the UTF-8 and apply a set of rules given by Unicode. And that's
exactly what libgrapheme does, only that it hides the middle layer of
code-points and only gives answers in byte-offsets.

The above emoji example might seem irrelevant (I myself dislike
emojis), but this concept is also used in many many other places,
including certain representations of umlauts.
For this reason, it is absolutely necessary to be able to handle
grapheme clusters to work with textual input consistently.

Consider that current solutions like ICU are very bloated, introduce
dynamic loading and are very hard to use. libgrapheme currently only
includes the function grapheme_len(const char *), which determines the
length (in bytes) of the grapheme cluster beginning at the given
char-pointer.

Grapheme offers the following:

   * follows grapheme cluster rules according to the latest
     Unicode standard version 13.0
   * automatically downloads/generates lookup-tables from unicode.org
   * automatically downloads/generates/runs conformance-tests from
     unicode.org
   * fully static and merely 20kB compiled

This is not a release and just an initial public commit, however, the
code is very stable. Feedback is greatly appreciated, especially input
on the API itself!

With best regards

Laslo Hunhold

[0]:https://git.suckless.org/libgrapheme/
[1]:https://dl.suckless.org/slcon/2019/slcon-2019-05-laslo_hunhold-reflections_on_unicode.webm
[2]:https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
[3]:https://unicode.org/reports/tr29/

PS: Here is a small example to get you started (compile with $(CC) -o
example -lgrapheme example.c). As you can see, it is possible for a
single "visible" characters to be many bytes.

It couldn't be simpler to work with it. Try doing the same with ICU and
you'll see what I mean.

-----------------------------------------------------------------------
#include <grapheme.h>
#include <stdio.h>

int
main(void)
{
        char *s = "Tëst 👨‍👩‍👦 🇺🇸 नी நி!";
        size_t len;

        /* print each grapheme cluster with accompanying byte-length */
        for (; *s != '\0';) {
                len = grapheme_len(s);
                printf("%2zu bytes | %.*s\n", len, (int)len, s, len);
                s += len;
        }

        return 0;
}
-----------------------------------------------------------------------
OUTPUT:
 1 bytes | T
 2 bytes | ë
 1 bytes | s
 1 bytes | t
 1 bytes |  
18 bytes | 👨‍👩‍👦
 1 bytes |  
 8 bytes | 🇺🇸
 1 bytes |  
 6 bytes | नी
 1 bytes |  
 6 bytes | நி
 1 bytes | !
-----------------------------------------------------------------------

Reply via email to