On Thu, 13 Feb 2020, Patrice Guérin wrote: > I'm facing some problems with the locale character table definitions.
Locales are a nightmare. We will all be able to rejoice when Unicode is everywhere. I'm afraid I know very little about locales, and as I'm a Linux user, I know nothing about Windows versions except that there are differences. > I really don't know who is right or not, and though the following > characters are not widely used (except euro symbol), I think this > can lead to inconsistencies. I'm sure it can, but I suspect there isn't anything that can be done about it. > * Windows defines all unassigned characters (in hex) 81, 8D, 8F, > 90, 9D as Ctrl, Linux does not. > * Linux defines char 0x80 (€ symbol) as Graph, Print and Punct, > Windows does not. > * Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and > Print, Windows does not. I do not understand what you mean by "0x88 (U+02c6)" because locales handle only 256 characters. > * Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph, > Print and Punct, Windows does not. > * Linux defines char 0xA0 (nbsp) as Graph, Print and Punct, > Windows defines it as Space, Blank and Print. > * Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct, > Linux does not. > * Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not. > * Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum > et Digit, Linux does not. > > Now, I've some questions : > > 1. If I correctly understood the process of the chartables build at > runtime, > 1. the ctype functions are used only in pcre2_maketables() so the > locale can be set just before this call at thread level. Yes. > 2. the char table returned should be freed after the calls to > pcre2_match() Yes. > 3. A compilation context is to be created to associate the char table. Yes. > 4. Can it be freed just after the call to pcre2_compile() ? The compilation context can be freed, but the tables themselves must be retained until all matches are done. > 2. What do you think about the availability of a function to load the > char tables as a binary file ? > This could be useful to get exactly the same tables in different OS. I suspect that this is a very specialist requirement, and I would rather encourage people to switch to Unicode. Also, moving binary things between OS is not that simple because of endian issues. And indeed 8/16/32 bit issues. Philip -- Philip Hazel -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev