Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold
On 2021-12-15, Laslo Hunhold wrote: > thanks for clearing that up! After more thought I made the decision to > go with uint8_t, though. I see the point regarding character types, but > this notion is more of a smelly foot in the C standard. We are moving > towards UTF-8 as _the_ default encoding format, so considering > character strings as such is justified. I think this is a mistake. It makes it very difficult to use the API correctly if you have data in an array of char or unsigned char, which is usually the case. Here's an example of some real code that has a char * buffer: https://git.sr.ht/~exec64/imv/tree/a83304d4d673aae6efed51da1986bd7315a4d642/item/src/console.c#L54-58 How would you suggest that this code be written for the new API? The only thing I can think is if (buffer[position] != 0) { size_t bufferlen = strlen(buffer) + 1 - position; uint8_t *newbuffer = malloc(bufferlen); if (!newbuffer) ... memcpy(newbuffer, buffer + position, bufferlen); position += grapheme_bytelen(newbuffer); free(newbuffer); } return position; This sort of thing would turn me off of using the library entirely. > Any other way would have introduced too many implicit assumptions. Like what? If you really want your code to break when CHAR_BIT != 8, you could use a static assert (there are also ways to emulate this in C99). But even if CHAR_BIT > 8, unsigned char is perfectly capable to represent all the values used in UTF-8 encoding, so I don't see the problem. > And even if all fails and there simply is no 8-bit-type, one can always > use the lg_grapheme_isbreak()-function and roll his own de/encoding. I'm still confused as to what you mean by rolling your own de/encoding. What would that look like? If there is no 8-bit type, libgrapheme could not be compiled or used at all since uint8_t would be missing.
Re: [hackers] [libgrapheme] Refine types (uint8_t -> char, uint32_t -> uint_least32_t) || Laslo Hunhold
On Sun, 12 Dec 2021 12:41:15 -0800 Michael Forney wrote: Dear Michael, > > But char and unsigned char are of integer type, aren't they? > > They are integer types and character types. Character types are a > subset of integer types: char, signed char, and unsigned char. > > > So on a > > POSIX-system, which is 99.999% of cases, it makes no difference if > > we cast between (char *) and (unsigned char *) (as you suggested > > above if we went with unsigned char * for the interfaces) and > > between (char *) and (uint_least8_t *), does it? So if the end-user > > has to cast anyway, then he can just cast to an uint* type as well. > > > > The difference is that uint8_t and uint_least8_t are not necessarily > character types. Although the existence of uint8_t implies that > unsigned char has exactly 8 bits, uint8_t could be a separate 8-bit > integer type distinct from the character types. If this were the case, > accessing an array of unsigned char through a pointer to uint8_t would > be undefined behavior (C99 6.5p7). > > Here are some examples: > > char a[1] = {0}; > // always valid, evaluates to 0 > *(unsigned char *)a; > // always valid, sets the bits of a[0] to > // but the value of a[0] depends on the signed-int representation > *(unsigned char *)a = 0xff; > // undefined behavior if uint8_t is not a character type > *(uint8_t *)a; > *(uint8_t *)a = 0xff; > > uint8_t b[1] = {0}; > // always valid, evaluates to 0 > *(unsigned char *)b; > // always valid, sets the bits of a[0] to > *(unsigned char *)b = 0xff; thanks for clearing that up! After more thought I made the decision to go with uint8_t, though. I see the point regarding character types, but this notion is more of a smelly foot in the C standard. We are moving towards UTF-8 as _the_ default encoding format, so considering character strings as such is justified. Any other way would have introduced too many implicit assumptions. > > Even more drastically, given UTF-8 is an encoding, I don't really > > feel good about not being strict about the returned arrays in such > > a way that it becomes possible to have an array of e.g. 16-bit > > integers where only the bottom half is used and it become the > > user's job to then hand-craft it into a proper array to send over > > the network, etc. Surely one can hack around this as a library > > user, but at a certain point I think "to hell with it" and just be > > strict about it in the API. C already has a weak type system and I > > don't want to further weaken it by supporting decades-old implicit > > assumptions on types. So in a way, maybe uint8_t is the way to go, > > and then the library user immediately knows it's not going to work > > with his machine because uint8_t is not defined for him. > > Not quite sure what you mean here. Are you talking about the case > where CHAR_BIT is 16? In that case, there'd be no uint8_t, so you > couldn't "hand-craft it into a proper array". I'm not sure how > networking APIs would work on such a system, but maybe they'd consider > only the lowest 8 bits of each byte. Yes exactly. Trying to import grapheme.h would immediately show that the system is incompatible rather than silently "breaking" on this behalf. Given how smart compilers have become working with "halves" of registers, I'd much rather expect the CPU to offer instructions to work with 8-bit-integers as "halves" of 16 bits (accessing lower and upper). And even if all fails and there simply is no 8-bit-type, one can always use the lg_grapheme_isbreak()-function and roll his own de/encoding. With best regards Laslo
Re: [hackers] [libgrapheme] Refactor Makefile, add dist-target and add test-util || Laslo Hunhold
On Wed, 15 Dec 2021 13:28:04 +0100 Quentin Rameau wrote: Dear Quentin, > > -GEN = gen/grapheme gen/grapheme-test > > -LIB = src/grapheme src/utf8 src/util > > -TEST = test/grapheme test/grapheme-performance test/utf8-decode > > test/utf8-encode - > > -MAN3 = man/lg_grapheme_isbreak.3 man/lg_grapheme_nextbreak.3 > > +GEN =\ > > + gen/grapheme\ > > + gen/grapheme-test > > +SRC =\ > > + src/grapheme\ > > + src/utf8\ > > + src/util > > +TEST =\ > > + test/grapheme\ > > + test/grapheme-performance\ > > + test/utf8-decode\ > > + test/utf8-encode > > +MAN3 =\ > > + man/lg_grapheme_isbreak.3\ > > + man/lg_grapheme_nextbreak.3 > > MAN7 = man/libgrapheme.7 > > > > all: libgrapheme.a libgrapheme.so > > The idiomatic way of using those is to escape the newline on every > macro line. > The goal here is to help producing less noise in patches which add or > remove lines there, so that only the actual concerned lines are > modified, not the one that may be the last because you now need to add > or remove a '\' there. thanks for this! I now pushed a commit that adapts this good idiom. With best regards Laslo
[hackers] [libgrapheme] Make lists in Makefile more idiomatic and avoid breaks || Laslo Hunhold
commit 20c105bcdd1c54401d4d23cdb9ded56ee7a2ffd4 Author: Laslo Hunhold AuthorDate: Wed Dec 15 13:34:06 2021 +0100 Commit: Laslo Hunhold CommitDate: Wed Dec 15 13:34:06 2021 +0100 Make lists in Makefile more idiomatic and avoid breaks Thanks Quentin Rameau for the remark regarding the more idiomatic way to specify lists where also the last element has an explicit line-break followed by an empty line. Also avoid breaks later on in the code: This breaks the line-length a bit, but has an effect on the output. I prefer a cleaner output (also in the build-logs) over one overlong line. Signed-off-by: Laslo Hunhold diff --git a/Makefile b/Makefile index fb8969a..104d805 100644 --- a/Makefile +++ b/Makefile @@ -7,22 +7,27 @@ include config.mk DATA =\ data/emoji-data.txt\ data/GraphemeBreakProperty.txt\ - data/GraphemeBreakTest.txt + data/GraphemeBreakTest.txt\ + GEN =\ gen/grapheme\ - gen/grapheme-test + gen/grapheme-test\ + SRC =\ src/grapheme\ src/utf8\ - src/util + src/util\ + TEST =\ test/grapheme\ test/grapheme-performance\ test/utf8-decode\ - test/utf8-encode + test/utf8-encode\ + MAN3 =\ man/lg_grapheme_isbreak.3\ - man/lg_grapheme_nextbreak.3 + man/lg_grapheme_nextbreak.3\ + MAN7 = man/libgrapheme.7 all: libgrapheme.a libgrapheme.so @@ -99,16 +104,14 @@ uninstall: rm -f "$(DESTDIR)$(INCPREFIX)/grapheme.h" clean: - rm -f $(GEN:=.h) $(GEN:=.o) $(GEN) gen/util.o $(SRC:=.o) src/util.o \ - $(TEST:=.o) test/util.o $(TEST) libgrapheme.a libgrapheme.so + rm -f $(GEN:=.h) $(GEN:=.o) gen/util.o $(GEN) $(SRC:=.o) src/util.o $(TEST:=.o) test/util.o $(TEST) libgrapheme.a libgrapheme.so clean-data: rm -f $(DATA) dist: - mkdir libgrapheme-$(VERSION) libgrapheme-$(VERSION)/data\ - libgrapheme-$(VERSION)/gen libgrapheme-$(VERSION)/man\ - libgrapheme-$(VERSION)/src libgrapheme-$(VERSION)/test + mkdir libgrapheme-$(VERSION) + for m in data gen man src test; do mkdir libgrapheme-$(VERSION)/$$m; done cp config.mk grapheme.h LICENSE Makefile libgrapheme-$(VERSION) cp $(DATA) libgrapheme-$(VERSION)/data cp $(GEN:=.c) gen/util.c gen/util.h libgrapheme-$(VERSION)/gen
Re: [hackers] [libgrapheme] Refactor Makefile, add dist-target and add test-util || Laslo Hunhold
Hi Laslo, As a note, > -GEN = gen/grapheme gen/grapheme-test > -LIB = src/grapheme src/utf8 src/util > -TEST = test/grapheme test/grapheme-performance test/utf8-decode > test/utf8-encode > - > -MAN3 = man/lg_grapheme_isbreak.3 man/lg_grapheme_nextbreak.3 > +GEN =\ > + gen/grapheme\ > + gen/grapheme-test > +SRC =\ > + src/grapheme\ > + src/utf8\ > + src/util > +TEST =\ > + test/grapheme\ > + test/grapheme-performance\ > + test/utf8-decode\ > + test/utf8-encode > +MAN3 =\ > + man/lg_grapheme_isbreak.3\ > + man/lg_grapheme_nextbreak.3 > MAN7 = man/libgrapheme.7 > > all: libgrapheme.a libgrapheme.so The idiomatic way of using those is to escape the newline on every macro line. The goal here is to help producing less noise in patches which add or remove lines there, so that only the actual concerned lines are modified, not the one that may be the last because you now need to add or remove a '\' there.
[hackers] [libgrapheme] Refactor Makefile, add dist-target and add test-util || Laslo Hunhold
commit 74c77bfd9932535d4b7a0a7d7cc7447164ead0d5 Author: Laslo Hunhold AuthorDate: Wed Dec 15 12:53:48 2021 +0100 Commit: Laslo Hunhold CommitDate: Wed Dec 15 12:53:48 2021 +0100 Refactor Makefile, add dist-target and add test-util All targets were checked and amended, if necessary. A new dist-target was added to quickly create a tarball. For the test-programs, given code-duplication, util.h and util.c were added. Signed-off-by: Laslo Hunhold diff --git a/Makefile b/Makefile index d166e34..fb8969a 100644 --- a/Makefile +++ b/Makefile @@ -8,11 +8,21 @@ DATA =\ data/emoji-data.txt\ data/GraphemeBreakProperty.txt\ data/GraphemeBreakTest.txt -GEN = gen/grapheme gen/grapheme-test -LIB = src/grapheme src/utf8 src/util -TEST = test/grapheme test/grapheme-performance test/utf8-decode test/utf8-encode - -MAN3 = man/lg_grapheme_isbreak.3 man/lg_grapheme_nextbreak.3 +GEN =\ + gen/grapheme\ + gen/grapheme-test +SRC =\ + src/grapheme\ + src/utf8\ + src/util +TEST =\ + test/grapheme\ + test/grapheme-performance\ + test/utf8-decode\ + test/utf8-encode +MAN3 =\ + man/lg_grapheme_isbreak.3\ + man/lg_grapheme_nextbreak.3 MAN7 = man/libgrapheme.7 all: libgrapheme.a libgrapheme.so @@ -20,20 +30,21 @@ all: libgrapheme.a libgrapheme.so gen/grapheme.o: gen/grapheme.c config.mk gen/util.h gen/grapheme-test.o: gen/grapheme-test.c config.mk gen/util.h gen/util.o: gen/util.c config.mk gen/util.h -src/utf8.o: src/utf8.c config.mk grapheme.h src/grapheme.o: src/grapheme.c config.mk gen/grapheme.h grapheme.h src/util.h -src/util.o: src/util.c config.mk src/util.h -test/grapheme.o: test/grapheme.c config.mk gen/grapheme-test.h grapheme.h -test/grapheme-performance.o: test/grapheme-performance.c config.mk gen/grapheme-test.h grapheme.h -test/utf8-encode.o: test/utf8-encode.c config.mk grapheme.h -test/utf8-decode.o: test/utf8-decode.c config.mk grapheme.h +src/utf8.o: src/utf8.c config.mk grapheme.h +src/util.o: src/util.c config.mk grapheme.h src/util.h +test/grapheme.o: test/grapheme.c config.mk gen/grapheme-test.h grapheme.h test/util.h +test/grapheme-performance.o: test/grapheme-performance.c config.mk gen/grapheme-test.h grapheme.h test/util.h +test/utf8-encode.o: test/utf8-encode.c config.mk grapheme.h test/util.h +test/utf8-decode.o: test/utf8-decode.c config.mk grapheme.h test/util.h +test/util.o: test/util.c config.mk test/util.h gen/grapheme: gen/grapheme.o gen/util.o gen/grapheme-test: gen/grapheme-test.o gen/util.o -test/grapheme: test/grapheme.o libgrapheme.a -test/grapheme-performance: test/grapheme-performance.o libgrapheme.a -test/utf8-encode: test/utf8-encode.o libgrapheme.a -test/utf8-decode: test/utf8-decode.o libgrapheme.a +test/grapheme: test/grapheme.o test/util.o libgrapheme.a +test/grapheme-performance: test/grapheme-performance.o test/util.o libgrapheme.a +test/utf8-encode: test/utf8-encode.o test/util.o libgrapheme.a +test/utf8-decode: test/utf8-decode.o test/util.o libgrapheme.a gen/grapheme.h: data/emoji-data.txt data/GraphemeBreakProperty.txt gen/grapheme gen/grapheme-test.h: data/GraphemeBreakTest.txt gen/grapheme-test @@ -54,16 +65,16 @@ $(GEN:=.h): $(@:.h=) > $@ $(TEST): - $(CC) -o $@ $(LDFLAGS) $@.o libgrapheme.a + $(CC) -o $@ $(LDFLAGS) $@.o test/util.o libgrapheme.a .c.o: $(CC) -c -o $@ $(CPPFLAGS) $(CFLAGS) $< -libgrapheme.a: $(LIB:=.o) +libgrapheme.a: $(SRC:=.o) $(AR) rc $@ $? $(RANLIB) $@ -libgrapheme.so: $(LIB:=.o) +libgrapheme.so: $(SRC:=.o) $(CC) -o $@ -shared $? test: $(TEST) @@ -88,7 +99,23 @@ uninstall: rm -f "$(DESTDIR)$(INCPREFIX)/grapheme.h" clean: - rm -f $(GEN:=.h) $(GEN:=.o) $(GEN) gen/util.o $(LIB:=.o) $(TEST:=.o) $(TEST) libgrapheme.a libgrapheme.so + rm -f $(GEN:=.h) $(GEN:=.o) $(GEN) gen/util.o $(SRC:=.o) src/util.o \ + $(TEST:=.o) test/util.o $(TEST) libgrapheme.a libgrapheme.so clean-data: rm -f $(DATA) + +dist: + mkdir libgrapheme-$(VERSION) libgrapheme-$(VERSION)/data\ + libgrapheme-$(VERSION)/gen libgrapheme-$(VERSION)/man\ + libgrapheme-$(VERSION)/src libgrapheme-$(VERSION)/test + cp config.mk grapheme.h LICENSE Makefile libgrapheme-$(VERSION) + cp $(DATA) libgrapheme-$(VERSION)/data + cp $(GEN:=.c) gen/util.c gen/util.h libgrapheme-$(VERSION)/gen + cp $(MAN3) $(MAN7) libgrapheme-$(VERSION)/man + cp $(SRC:=.c) src/util.h libgrapheme-$(VERSION)/src + cp $(TEST:=.c) test/util.c test/util.h libgrapheme-$(VERSION)/test + tar -cf libgrapheme-$(VERSION).tar libgrapheme-$(VERSION) + rm -rf libgrapheme-$(VERSION) + +.PHONY: all test install uninstall clean dist diff --git a/test/grapheme-performance.c b/test/grapheme-performance.c index 05035bd..4bfd429 100644 --- a/test/grapheme-performance.c +++ b/test/grapheme-performance.c @@
[hackers] [libgrapheme] Refactor manual pages, document lg_grapheme_isbreak() || Laslo Hunhold
commit 497b500df21812b49729ff9514dd81dac29ec940 Author: Laslo Hunhold AuthorDate: Wed Dec 15 10:59:42 2021 +0100 Commit: Laslo Hunhold CommitDate: Wed Dec 15 10:59:42 2021 +0100 Refactor manual pages, document lg_grapheme_isbreak() In particular, simplify the given example in lg_grapheme_nextbreak(). Signed-off-by: Laslo Hunhold diff --git a/Makefile b/Makefile index d626c8f..d166e34 100644 --- a/Makefile +++ b/Makefile @@ -12,7 +12,7 @@ GEN = gen/grapheme gen/grapheme-test LIB = src/grapheme src/utf8 src/util TEST = test/grapheme test/grapheme-performance test/utf8-decode test/utf8-encode -MAN3 = man/grapheme_bytelen.3 +MAN3 = man/lg_grapheme_isbreak.3 man/lg_grapheme_nextbreak.3 MAN7 = man/libgrapheme.7 all: libgrapheme.a libgrapheme.so diff --git a/man/grapheme_bytelen.3 b/man/grapheme_bytelen.3 deleted file mode 100644 index 0e26570..000 --- a/man/grapheme_bytelen.3 +++ /dev/null @@ -1,85 +0,0 @@ -.Dd 2020-10-12 -.Dt GRAPHEME_BYTELEN 3 -.Os suckless.org -.Sh NAME -.Nm grapheme_bytelen -.Nd compute grapheme cluster byte-length -.Sh SYNOPSIS -.In grapheme.h -.Ft size_t -.Fn grapheme_bytelen "const char *str" -.Sh DESCRIPTION -The -.Fn grapheme_bytelen -function computes the length (in bytes) of the grapheme cluster -(see -.Xr libgrapheme 7 ) -beginning at the UTF-8-encoded NUL-terminated string -.Va str . -.Sh RETURN VALUES -The -.Fn grapheme_bytelen -function returns the length (in bytes) of the grapheme cluster beginning -at -.Va str -or 0 if -.Va str -is -.Dv NULL . -.Sh EXAMPLES -.Bd -literal -/* cc (-static) -o example example.c -lgrapheme */ -#include -#include - -int -main(void) -{ - /* UTF-8 encoded input */ - char *s = - "T" - "\\xC3\\xAB" /* U+000EB LATIN SMALL LETTER E - WITH DIAERESIS */ - "s" - "t" - " " - "\\xF0\\x9F\\x91\\xA8" /* U+1F468 MAN */ - "\\xE2\\x80\\x8D" /* U+0200D ZERO WIDTH JOINER */ - "\\xF0\\x9F\\x91\\xA9" /* U+1F469 WOMAN */ - "\\xE2\\x80\\x8D" /* U+0200D ZERO WIDTH JOINER */ - "\\xF0\\x9F\\x91\\xA6" /* U+1F466 BOY */ - " " - "\\xF0\\x9F\\x87\\xBA" /* U+1F1FA REGIONAL INDICATOR - SYMBOL LETTER U */ - "\\xF0\\x9F\\x87\\xB8" /* U+1F1F8 REGIONAL INDICATOR - SYMBOL LETTER S */ - " " - "\\xE0\\xA4\\xA8" /* U+00928 DEVANAGARI LETTER NA */ - "\\xE0\\xA5\\x80" /* U+00940 DEVANAGARI VOWEL - SIGN II */ - " " - "\\xE0\\xAE\\xA8" /* U+00BA8 TAMIL LETTER NA */ - "\\xE0\\xAE\\xBF" /* U+00BBF TAMIL VOWEL SIGN I */ - "!"; - size_t len; - - /* print input string */ - printf("Input: %s\\n", s); - - /* print each grapheme cluster with accompanying byte-length */ - while (*s != '\\0') { - len = grapheme_bytelen(s); - printf("%2zu byte(s) | %.*s\\n", len, (int)len, s, len); - s += len; - } - - return 0; -} -.Ed -.Sh SEE ALSO -.Xr libgrapheme 7 -.Sh STANDARDS -.Fn grapheme_bytelen -is compliant with the Unicode 13.0.0 specification. -.Sh AUTHORS -.An Laslo Hunhold Aq Mt d...@frign.de diff --git a/man/lg_grapheme_isbreak.3 b/man/lg_grapheme_isbreak.3 new file mode 100644 index 000..2570b2f --- /dev/null +++ b/man/lg_grapheme_isbreak.3 @@ -0,0 +1,79 @@ +.Dd 2021-12-15 +.Dt LG_GRAPHEME_ISBREAK 3 +.Os suckless.org +.Sh NAME +.Nm lg_grapheme_isbreak +.Nd test for a grapheme cluster break between two code points +.Sh SYNOPSIS +.In grapheme.h +.Ft size_t +.Fn lg_grapheme_isbreak "uint_least32_t a, uint_least32_t b, LG_SEGMENTATION_STATE *state" +.Sh DESCRIPTION +The +.Fn lg_grapheme_isbreak +function determines if there is a grapheme cluster break (see +.Xr libgrapheme 7 ) +between the two code points +.Va a +and +.Va b . +By specification this decision depends on a +.Va state +that can at most be completely reset after detecting a break and must +be reset every time one deviates from sequential processing. +.Pp +If +.Va state +is +.Dv NULL +.Fn lg_grapheme_isbreak +behaves as if it was called with a fully reset state. +.Sh RETURN VALUES +.Fn lg_grapheme_isbreak +returns +.Va true +if there is a grapheme cluster break between the code points +.Va a +and +.Va b +and +.Va false +if there is not. +.Sh EXAMPLES +.Bd -literal +/* cc (-static) -o example example.c -lgrapheme */ +#include +#include +#include +#include + +int +main(void) +{ + LG_SEGMENTATION_STATE state = { 0 }; + uint_least32_t s1[] = ..., s2[] = ...; /* two input arrays */ + size_t i; + + for (i = 0; i + 1 < sizeof(s1) / sizeof(*s1); i++) { + if (lg_grapheme_isbreak(s[i],