Updating utf8's Unicode char sets

Wolfgang Corcoran-Mathe via Chicken-users Wed, 02 Jul 2025 10:00:11 -0700

Hi all,

The utf8 egg's unicode-char-sets component is woefully out of date:
according to the header comment, the current set definitions were
generated in July 2007. Since then, Unicode has been enriched with
new characters & with wonderful things like emoji. It's time the
sets were updated.


Therefore, I've made a new version of utf8 which generates all
character sets at build time, from the official UCD data files. I've
attached my low-dependency build script & related files to the
following ticket:

http://bugs.call-cc.org/ticket/1851

Since the generated sets are sometimes very large, I've also split
the unicode-char-sets component into per-set modules, e.g.
(unicode-char-sets arabic). The (unicode-char-sets) module re-exports
all Unicode character sets, making it backwards-compatible with the
old, monolithic (unicode-char-sets).

A minor issue which I haven't yet solved is how to compile the
generated modules. Currently, the script invoked by custom-build
simply runs csc (without custom options) on each module file. This
ignores the compiler options that would usually be added by
chicken-install, but I'm not sure how to retrieve those options &
to invoke the compiler "correctly".

Let me know what you think. If these changes are appreciated, I'll
work on the case-mapping procedures next.

Regards,

Wolfgang

-- 
Wolfgang Corcoran-Mathe  <w...@sigwinch.xyz>

Updating utf8's Unicode char sets

Reply via email to