On Thu, 13 Feb 2020, Patrice Guérin wrote:

> I'm facing some problems with the locale character table definitions. 

Locales are a nightmare. We will all be able to rejoice when Unicode is
everywhere. I'm afraid I know very little about locales, and as I'm a
Linux user, I know nothing about Windows versions except that there are
differences.

>    I really don't know who is right or not, and though the following
>    characters are not widely used (except euro symbol), I think this
>    can lead to inconsistencies.

I'm sure it can, but I suspect there isn't anything that can be done 
about it.

>      * Windows defines all unassigned characters (in hex) 81, 8D, 8F,
>        90, 9D as Ctrl, Linux does not.
>      * Linux defines char 0x80 (€ symbol) as Graph, Print and Punct,
>        Windows does not.
>      * Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and
>        Print, Windows does not.

I do not understand what you mean by "0x88 (U+02c6)" because locales 
handle only 256 characters.

>      * Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph,
>        Print and Punct, Windows does not.
>      * Linux defines char 0xA0 (nbsp) as Graph, Print and Punct,
>        Windows defines it as Space, Blank and Print.
>      * Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct,
>        Linux does not.
>      * Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not.
>      * Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum
>        et Digit, Linux does not.
> 
> Now, I've some questions :
> 
> 1. If I correctly understood the process of the chartables build at
>    runtime,
>     1. the ctype functions are used only in pcre2_maketables() so the
>        locale can be set just before this call at thread level.

Yes.

>     2. the char table returned should be freed after the calls to
>        pcre2_match()

Yes.

>     3. A compilation context is to be created to associate the char table.

Yes.

>     4. Can it be freed just after the call to pcre2_compile() ?

The compilation context can be freed, but the tables themselves must be 
retained until all matches are done.

> 2. What do you think about the availability of a function to load the
>    char tables as a binary file ?
>    This could be useful to get exactly the same tables in different OS.

I suspect that this is a very specialist requirement, and I would rather 
encourage people to switch to Unicode. Also, moving binary things 
between OS is not that simple because of endian issues. And indeed 
8/16/32 bit issues.

Philip

-- 
Philip Hazel
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to