Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

2021-12-15 Thread Michael Forney
On 2021-12-15, Laslo Hunhold  wrote:
> thanks for clearing that up! After more thought I made the decision to
> go with uint8_t, though. I see the point regarding character types, but
> this notion is more of a smelly foot in the C standard. We are moving
> towards UTF-8 as _the_ default encoding format, so considering
> character strings as such is justified.

I think this is a mistake. It makes it very difficult to use the API
correctly if you have data in an array of char or unsigned char, which
is usually the case.

Here's an example of some real code that has a char * buffer:
https://git.sr.ht/~exec64/imv/tree/a83304d4d673aae6efed51da1986bd7315a4d642/item/src/console.c#L54-58

How would you suggest that this code be written for the new API? The
only thing I can think is

if (buffer[position] != 0) {
  size_t bufferlen = strlen(buffer) + 1 - position;
  uint8_t *newbuffer = malloc(bufferlen);
  if (!newbuffer) ...
  memcpy(newbuffer, buffer + position, bufferlen);
  position += grapheme_bytelen(newbuffer);
  free(newbuffer);
}
return position;

This sort of thing would turn me off of using the library entirely.

> Any other way would have introduced too many implicit assumptions.

Like what?

If you really want your code to break when CHAR_BIT != 8, you could
use a static assert (there are also ways to emulate this in C99). But
even if CHAR_BIT > 8, unsigned char is perfectly capable to represent
all the values used in UTF-8 encoding, so I don't see the problem.

> And even if all fails and there simply is no 8-bit-type, one can always
> use the lg_grapheme_isbreak()-function and roll his own de/encoding.

I'm still confused as to what you mean by rolling your own
de/encoding. What would that look like?

If there is no 8-bit type, libgrapheme could not be compiled or used
at all since uint8_t would be missing.



Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold

2021-12-15 Thread Laslo Hunhold
On Sun, 12 Dec 2021 12:41:15 -0800
Michael Forney  wrote:

Dear Michael,

> > But char and unsigned char are of integer type, aren't they?  
> 
> They are integer types and character types. Character types are a
> subset of integer types: char, signed char, and unsigned char.
> 
> > So on a
> > POSIX-system, which is 99.999% of cases, it makes no difference if
> > we cast between (char *) and (unsigned char *) (as you suggested
> > above if we went with unsigned char * for the interfaces) and
> > between (char *) and (uint_least8_t *), does it? So if the end-user
> > has to cast anyway, then he can just cast to an uint* type as well.
> >  
> 
> The difference is that uint8_t and uint_least8_t are not necessarily
> character types. Although the existence of uint8_t implies that
> unsigned char has exactly 8 bits, uint8_t could be a separate 8-bit
> integer type distinct from the character types. If this were the case,
> accessing an array of unsigned char through a pointer to uint8_t would
> be undefined behavior (C99 6.5p7).
> 
> Here are some examples:
> 
> char a[1] = {0};
> // always valid, evaluates to 0
> *(unsigned char *)a;
> // always valid, sets the bits of a[0] to 
> // but the value of a[0] depends on the signed-int representation
> *(unsigned char *)a = 0xff;
> // undefined behavior if uint8_t is not a character type
> *(uint8_t *)a;
> *(uint8_t *)a = 0xff;
> 
> uint8_t b[1] = {0};
> // always valid, evaluates to 0
> *(unsigned char *)b;
> // always valid, sets the bits of a[0] to 
> *(unsigned char *)b = 0xff;

thanks for clearing that up! After more thought I made the decision to
go with uint8_t, though. I see the point regarding character types, but
this notion is more of a smelly foot in the C standard. We are moving
towards UTF-8 as _the_ default encoding format, so considering
character strings as such is justified.
Any other way would have introduced too many implicit assumptions.

> > Even more drastically, given UTF-8 is an encoding, I don't really
> > feel good about not being strict about the returned arrays in such
> > a way that it becomes possible to have an array of e.g. 16-bit
> > integers where only the bottom half is used and it become the
> > user's job to then hand-craft it into a proper array to send over
> > the network, etc. Surely one can hack around this as a library
> > user, but at a certain point I think "to hell with it" and just be
> > strict about it in the API. C already has a weak type system and I
> > don't want to further weaken it by supporting decades-old implicit
> > assumptions on types. So in a way, maybe uint8_t is the way to go,
> > and then the library user immediately knows it's not going to work
> > with his machine because uint8_t is not defined for him.  
> 
> Not quite sure what you mean here. Are you talking about the case
> where CHAR_BIT is 16? In that case, there'd be no uint8_t, so you
> couldn't "hand-craft it into a proper array". I'm not sure how
> networking APIs would work on such a system, but maybe they'd consider
> only the lowest 8 bits of each byte.

Yes exactly. Trying to import grapheme.h would immediately show that
the system is incompatible rather than silently "breaking" on this
behalf. Given how smart compilers have become working with "halves" of
registers, I'd much rather expect the CPU to offer instructions to work
with 8-bit-integers as "halves" of 16 bits (accessing lower and upper).

And even if all fails and there simply is no 8-bit-type, one can always
use the lg_grapheme_isbreak()-function and roll his own de/encoding.

With best regards

Laslo



Re: [hackers] [libgrapheme] Refactor Makefile, add dist-target and add test-util || Laslo Hunhold

2021-12-15 Thread Laslo Hunhold
On Wed, 15 Dec 2021 13:28:04 +0100
Quentin Rameau  wrote:

Dear Quentin,

> > -GEN = gen/grapheme gen/grapheme-test
> > -LIB = src/grapheme src/utf8 src/util
> > -TEST = test/grapheme test/grapheme-performance test/utf8-decode
> > test/utf8-encode -
> > -MAN3 = man/lg_grapheme_isbreak.3 man/lg_grapheme_nextbreak.3
> > +GEN =\
> > +   gen/grapheme\
> > +   gen/grapheme-test
> > +SRC =\
> > +   src/grapheme\
> > +   src/utf8\
> > +   src/util
> > +TEST =\
> > +   test/grapheme\
> > +   test/grapheme-performance\
> > +   test/utf8-decode\
> > +   test/utf8-encode
> > +MAN3 =\
> > +   man/lg_grapheme_isbreak.3\
> > +   man/lg_grapheme_nextbreak.3
> >  MAN7 = man/libgrapheme.7
> >  
> >  all: libgrapheme.a libgrapheme.so  
> 
> The idiomatic way of using those is to escape the newline on every
> macro line.
> The goal here is to help producing less noise in patches which add or
> remove lines there, so that only the actual concerned lines are
> modified, not the one that may be the last because you now need to add
> or remove a '\' there.

thanks for this! I now pushed a commit that adapts this good idiom.

With best regards

Laslo



[hackers] [libgrapheme] Make lists in Makefile more idiomatic and avoid breaks || Laslo Hunhold

2021-12-15 Thread git
commit 20c105bcdd1c54401d4d23cdb9ded56ee7a2ffd4
Author: Laslo Hunhold 
AuthorDate: Wed Dec 15 13:34:06 2021 +0100
Commit: Laslo Hunhold 
CommitDate: Wed Dec 15 13:34:06 2021 +0100

Make lists in Makefile more idiomatic and avoid breaks

Thanks Quentin Rameau for the remark regarding the more idiomatic
way to specify lists where also the last element has an explicit
line-break followed by an empty line.

Also avoid breaks later on in the code: This breaks the line-length
a bit, but has an effect on the output. I prefer a cleaner output
(also in the build-logs) over one overlong line.

Signed-off-by: Laslo Hunhold 

diff --git a/Makefile b/Makefile
index fb8969a..104d805 100644
--- a/Makefile
+++ b/Makefile
@@ -7,22 +7,27 @@ include config.mk
 DATA =\
data/emoji-data.txt\
data/GraphemeBreakProperty.txt\
-   data/GraphemeBreakTest.txt
+   data/GraphemeBreakTest.txt\
+
 GEN =\
gen/grapheme\
-   gen/grapheme-test
+   gen/grapheme-test\
+
 SRC =\
src/grapheme\
src/utf8\
-   src/util
+   src/util\
+
 TEST =\
test/grapheme\
test/grapheme-performance\
test/utf8-decode\
-   test/utf8-encode
+   test/utf8-encode\
+
 MAN3 =\
man/lg_grapheme_isbreak.3\
-   man/lg_grapheme_nextbreak.3
+   man/lg_grapheme_nextbreak.3\
+
 MAN7 = man/libgrapheme.7
 
 all: libgrapheme.a libgrapheme.so
@@ -99,16 +104,14 @@ uninstall:
rm -f "$(DESTDIR)$(INCPREFIX)/grapheme.h"
 
 clean:
-   rm -f $(GEN:=.h) $(GEN:=.o) $(GEN) gen/util.o $(SRC:=.o) src/util.o \
-   $(TEST:=.o) test/util.o $(TEST) libgrapheme.a libgrapheme.so
+   rm -f $(GEN:=.h) $(GEN:=.o) gen/util.o $(GEN) $(SRC:=.o) src/util.o 
$(TEST:=.o) test/util.o $(TEST) libgrapheme.a libgrapheme.so
 
 clean-data:
rm -f $(DATA)
 
 dist:
-   mkdir libgrapheme-$(VERSION) libgrapheme-$(VERSION)/data\
-   libgrapheme-$(VERSION)/gen libgrapheme-$(VERSION)/man\
-   libgrapheme-$(VERSION)/src libgrapheme-$(VERSION)/test
+   mkdir libgrapheme-$(VERSION)
+   for m in data gen man src test; do mkdir libgrapheme-$(VERSION)/$$m; 
done
cp config.mk grapheme.h LICENSE Makefile libgrapheme-$(VERSION)
cp $(DATA) libgrapheme-$(VERSION)/data
cp $(GEN:=.c) gen/util.c gen/util.h libgrapheme-$(VERSION)/gen



Re: [hackers] [libgrapheme] Refactor Makefile, add dist-target and add test-util || Laslo Hunhold

2021-12-15 Thread Quentin Rameau
Hi Laslo,

As a note,

> -GEN = gen/grapheme gen/grapheme-test
> -LIB = src/grapheme src/utf8 src/util
> -TEST = test/grapheme test/grapheme-performance test/utf8-decode 
> test/utf8-encode
> -
> -MAN3 = man/lg_grapheme_isbreak.3 man/lg_grapheme_nextbreak.3
> +GEN =\
> + gen/grapheme\
> + gen/grapheme-test
> +SRC =\
> + src/grapheme\
> + src/utf8\
> + src/util
> +TEST =\
> + test/grapheme\
> + test/grapheme-performance\
> + test/utf8-decode\
> + test/utf8-encode
> +MAN3 =\
> + man/lg_grapheme_isbreak.3\
> + man/lg_grapheme_nextbreak.3
>  MAN7 = man/libgrapheme.7
>  
>  all: libgrapheme.a libgrapheme.so

The idiomatic way of using those is to escape the newline on every macro
line.
The goal here is to help producing less noise in patches which add or
remove lines there, so that only the actual concerned lines are
modified, not the one that may be the last because you now need to add
or remove a '\' there.



[hackers] [libgrapheme] Refactor Makefile, add dist-target and add test-util || Laslo Hunhold

2021-12-15 Thread git
commit 74c77bfd9932535d4b7a0a7d7cc7447164ead0d5
Author: Laslo Hunhold 
AuthorDate: Wed Dec 15 12:53:48 2021 +0100
Commit: Laslo Hunhold 
CommitDate: Wed Dec 15 12:53:48 2021 +0100

Refactor Makefile, add dist-target and add test-util

All targets were checked and amended, if necessary. A new dist-target
was added to quickly create a tarball.

For the test-programs, given code-duplication, util.h and util.c
were added.

Signed-off-by: Laslo Hunhold 

diff --git a/Makefile b/Makefile
index d166e34..fb8969a 100644
--- a/Makefile
+++ b/Makefile
@@ -8,11 +8,21 @@ DATA =\
data/emoji-data.txt\
data/GraphemeBreakProperty.txt\
data/GraphemeBreakTest.txt
-GEN = gen/grapheme gen/grapheme-test
-LIB = src/grapheme src/utf8 src/util
-TEST = test/grapheme test/grapheme-performance test/utf8-decode 
test/utf8-encode
-
-MAN3 = man/lg_grapheme_isbreak.3 man/lg_grapheme_nextbreak.3
+GEN =\
+   gen/grapheme\
+   gen/grapheme-test
+SRC =\
+   src/grapheme\
+   src/utf8\
+   src/util
+TEST =\
+   test/grapheme\
+   test/grapheme-performance\
+   test/utf8-decode\
+   test/utf8-encode
+MAN3 =\
+   man/lg_grapheme_isbreak.3\
+   man/lg_grapheme_nextbreak.3
 MAN7 = man/libgrapheme.7
 
 all: libgrapheme.a libgrapheme.so
@@ -20,20 +30,21 @@ all: libgrapheme.a libgrapheme.so
 gen/grapheme.o: gen/grapheme.c config.mk gen/util.h
 gen/grapheme-test.o: gen/grapheme-test.c config.mk gen/util.h
 gen/util.o: gen/util.c config.mk gen/util.h
-src/utf8.o: src/utf8.c config.mk grapheme.h
 src/grapheme.o: src/grapheme.c config.mk gen/grapheme.h grapheme.h src/util.h
-src/util.o: src/util.c config.mk src/util.h
-test/grapheme.o: test/grapheme.c config.mk gen/grapheme-test.h grapheme.h
-test/grapheme-performance.o: test/grapheme-performance.c config.mk 
gen/grapheme-test.h grapheme.h
-test/utf8-encode.o: test/utf8-encode.c config.mk grapheme.h
-test/utf8-decode.o: test/utf8-decode.c config.mk grapheme.h
+src/utf8.o: src/utf8.c config.mk grapheme.h
+src/util.o: src/util.c config.mk grapheme.h src/util.h
+test/grapheme.o: test/grapheme.c config.mk gen/grapheme-test.h grapheme.h 
test/util.h
+test/grapheme-performance.o: test/grapheme-performance.c config.mk 
gen/grapheme-test.h grapheme.h test/util.h
+test/utf8-encode.o: test/utf8-encode.c config.mk grapheme.h test/util.h
+test/utf8-decode.o: test/utf8-decode.c config.mk grapheme.h test/util.h
+test/util.o: test/util.c config.mk test/util.h
 
 gen/grapheme: gen/grapheme.o gen/util.o
 gen/grapheme-test: gen/grapheme-test.o gen/util.o
-test/grapheme: test/grapheme.o libgrapheme.a
-test/grapheme-performance: test/grapheme-performance.o libgrapheme.a
-test/utf8-encode: test/utf8-encode.o libgrapheme.a
-test/utf8-decode: test/utf8-decode.o libgrapheme.a
+test/grapheme: test/grapheme.o test/util.o libgrapheme.a
+test/grapheme-performance: test/grapheme-performance.o test/util.o 
libgrapheme.a
+test/utf8-encode: test/utf8-encode.o test/util.o libgrapheme.a
+test/utf8-decode: test/utf8-decode.o test/util.o libgrapheme.a
 
 gen/grapheme.h: data/emoji-data.txt data/GraphemeBreakProperty.txt gen/grapheme
 gen/grapheme-test.h: data/GraphemeBreakTest.txt gen/grapheme-test
@@ -54,16 +65,16 @@ $(GEN:=.h):
$(@:.h=) > $@
 
 $(TEST):
-   $(CC) -o $@ $(LDFLAGS) $@.o libgrapheme.a
+   $(CC) -o $@ $(LDFLAGS) $@.o test/util.o libgrapheme.a
 
 .c.o:
$(CC) -c -o $@ $(CPPFLAGS) $(CFLAGS) $<
 
-libgrapheme.a: $(LIB:=.o)
+libgrapheme.a: $(SRC:=.o)
$(AR) rc $@ $?
$(RANLIB) $@
 
-libgrapheme.so: $(LIB:=.o)
+libgrapheme.so: $(SRC:=.o)
$(CC) -o $@ -shared $?
 
 test: $(TEST)
@@ -88,7 +99,23 @@ uninstall:
rm -f "$(DESTDIR)$(INCPREFIX)/grapheme.h"
 
 clean:
-   rm -f $(GEN:=.h) $(GEN:=.o) $(GEN) gen/util.o $(LIB:=.o) $(TEST:=.o) 
$(TEST) libgrapheme.a libgrapheme.so
+   rm -f $(GEN:=.h) $(GEN:=.o) $(GEN) gen/util.o $(SRC:=.o) src/util.o \
+   $(TEST:=.o) test/util.o $(TEST) libgrapheme.a libgrapheme.so
 
 clean-data:
rm -f $(DATA)
+
+dist:
+   mkdir libgrapheme-$(VERSION) libgrapheme-$(VERSION)/data\
+   libgrapheme-$(VERSION)/gen libgrapheme-$(VERSION)/man\
+   libgrapheme-$(VERSION)/src libgrapheme-$(VERSION)/test
+   cp config.mk grapheme.h LICENSE Makefile libgrapheme-$(VERSION)
+   cp $(DATA) libgrapheme-$(VERSION)/data
+   cp $(GEN:=.c) gen/util.c gen/util.h libgrapheme-$(VERSION)/gen
+   cp $(MAN3) $(MAN7) libgrapheme-$(VERSION)/man
+   cp $(SRC:=.c) src/util.h libgrapheme-$(VERSION)/src
+   cp $(TEST:=.c) test/util.c test/util.h libgrapheme-$(VERSION)/test
+   tar -cf libgrapheme-$(VERSION).tar libgrapheme-$(VERSION)
+   rm -rf libgrapheme-$(VERSION)
+
+.PHONY: all test install uninstall clean dist
diff --git a/test/grapheme-performance.c b/test/grapheme-performance.c
index 05035bd..4bfd429 100644
--- a/test/grapheme-performance.c
+++ b/test/grapheme-performance.c
@@ 

[hackers] [libgrapheme] Refactor manual pages, document lg_grapheme_isbreak() || Laslo Hunhold

2021-12-15 Thread git
commit 497b500df21812b49729ff9514dd81dac29ec940
Author: Laslo Hunhold 
AuthorDate: Wed Dec 15 10:59:42 2021 +0100
Commit: Laslo Hunhold 
CommitDate: Wed Dec 15 10:59:42 2021 +0100

Refactor manual pages, document lg_grapheme_isbreak()

In particular, simplify the given example in lg_grapheme_nextbreak().

Signed-off-by: Laslo Hunhold 

diff --git a/Makefile b/Makefile
index d626c8f..d166e34 100644
--- a/Makefile
+++ b/Makefile
@@ -12,7 +12,7 @@ GEN = gen/grapheme gen/grapheme-test
 LIB = src/grapheme src/utf8 src/util
 TEST = test/grapheme test/grapheme-performance test/utf8-decode 
test/utf8-encode
 
-MAN3 = man/grapheme_bytelen.3
+MAN3 = man/lg_grapheme_isbreak.3 man/lg_grapheme_nextbreak.3
 MAN7 = man/libgrapheme.7
 
 all: libgrapheme.a libgrapheme.so
diff --git a/man/grapheme_bytelen.3 b/man/grapheme_bytelen.3
deleted file mode 100644
index 0e26570..000
--- a/man/grapheme_bytelen.3
+++ /dev/null
@@ -1,85 +0,0 @@
-.Dd 2020-10-12
-.Dt GRAPHEME_BYTELEN 3
-.Os suckless.org
-.Sh NAME
-.Nm grapheme_bytelen
-.Nd compute grapheme cluster byte-length
-.Sh SYNOPSIS
-.In grapheme.h
-.Ft size_t
-.Fn grapheme_bytelen "const char *str"
-.Sh DESCRIPTION
-The
-.Fn grapheme_bytelen
-function computes the length (in bytes) of the grapheme cluster
-(see
-.Xr libgrapheme 7 )
-beginning at the UTF-8-encoded NUL-terminated string
-.Va str .
-.Sh RETURN VALUES
-The
-.Fn grapheme_bytelen
-function returns the length (in bytes) of the grapheme cluster beginning
-at
-.Va str
-or 0 if
-.Va str
-is
-.Dv NULL .
-.Sh EXAMPLES
-.Bd -literal
-/* cc (-static) -o example example.c -lgrapheme */
-#include 
-#include 
-
-int
-main(void)
-{
-   /* UTF-8 encoded input */
-   char *s =
-   "T"
-   "\\xC3\\xAB" /* U+000EB LATIN SMALL LETTER E
- WITH DIAERESIS */
-   "s"
-   "t"
-   " "
-   "\\xF0\\x9F\\x91\\xA8" /* U+1F468 MAN */
-   "\\xE2\\x80\\x8D" /* U+0200D ZERO WIDTH JOINER */
-   "\\xF0\\x9F\\x91\\xA9" /* U+1F469 WOMAN */
-   "\\xE2\\x80\\x8D" /* U+0200D ZERO WIDTH JOINER */
-   "\\xF0\\x9F\\x91\\xA6" /* U+1F466 BOY */
-   " "
-   "\\xF0\\x9F\\x87\\xBA" /* U+1F1FA REGIONAL INDICATOR
- SYMBOL LETTER U */
-   "\\xF0\\x9F\\x87\\xB8" /* U+1F1F8 REGIONAL INDICATOR
- SYMBOL LETTER S */
-   " "
-   "\\xE0\\xA4\\xA8" /* U+00928 DEVANAGARI LETTER NA */
-   "\\xE0\\xA5\\x80" /* U+00940 DEVANAGARI VOWEL
- SIGN II */
-   " "
-   "\\xE0\\xAE\\xA8" /* U+00BA8 TAMIL LETTER NA */
-   "\\xE0\\xAE\\xBF" /* U+00BBF TAMIL VOWEL SIGN I */
-   "!";
-   size_t len;
-
-   /* print input string */
-   printf("Input: %s\\n", s);
-
-   /* print each grapheme cluster with accompanying byte-length */
-   while (*s != '\\0') {
-   len = grapheme_bytelen(s);
-   printf("%2zu byte(s) | %.*s\\n", len, (int)len, s, len);
-   s += len;
-   }
-
-   return 0;
-}
-.Ed
-.Sh SEE ALSO
-.Xr libgrapheme 7
-.Sh STANDARDS
-.Fn grapheme_bytelen
-is compliant with the Unicode 13.0.0 specification.
-.Sh AUTHORS
-.An Laslo Hunhold Aq Mt d...@frign.de
diff --git a/man/lg_grapheme_isbreak.3 b/man/lg_grapheme_isbreak.3
new file mode 100644
index 000..2570b2f
--- /dev/null
+++ b/man/lg_grapheme_isbreak.3
@@ -0,0 +1,79 @@
+.Dd 2021-12-15
+.Dt LG_GRAPHEME_ISBREAK 3
+.Os suckless.org
+.Sh NAME
+.Nm lg_grapheme_isbreak
+.Nd test for a grapheme cluster break between two code points
+.Sh SYNOPSIS
+.In grapheme.h
+.Ft size_t
+.Fn lg_grapheme_isbreak "uint_least32_t a, uint_least32_t b, 
LG_SEGMENTATION_STATE *state"
+.Sh DESCRIPTION
+The
+.Fn lg_grapheme_isbreak
+function determines if there is a grapheme cluster break (see
+.Xr libgrapheme 7 )
+between the two code points
+.Va a
+and
+.Va b .
+By specification this decision depends on a
+.Va state
+that can at most be completely reset after detecting a break and must
+be reset every time one deviates from sequential processing.
+.Pp
+If
+.Va state
+is
+.Dv NULL
+.Fn lg_grapheme_isbreak
+behaves as if it was called with a fully reset state.
+.Sh RETURN VALUES
+.Fn lg_grapheme_isbreak
+returns
+.Va true
+if there is a grapheme cluster break between the code points
+.Va a
+and
+.Va b
+and
+.Va false
+if there is not.
+.Sh EXAMPLES
+.Bd -literal
+/* cc (-static) -o example example.c -lgrapheme */
+#include 
+#include 
+#include 
+#include 
+
+int
+main(void)
+{
+   LG_SEGMENTATION_STATE state = { 0 };
+   uint_least32_t s1[] = ..., s2[] = ...; /* two input arrays */
+   size_t i;
+
+   for (i = 0; i + 1 < sizeof(s1) / sizeof(*s1); i++) {
+   if (lg_grapheme_isbreak(s[i],