Re: About regex, charset, and localization.

Steffen Nurpmeso via austin-group-l at The Open Group Fri, 20 Jun 2025 07:42:48 -0700

[uh, now even i top-post!!]

k...@keldix.com via austin-group-l at The Open Group wrote in
 <20250620112339.gb29...@www5.open-std.org>:
 |the wg20 issues are now tranferred to sc35/wg5
 |they issued a standard iso/iec 30112 - built on posix, c and unicode


I myself had problems to parse his problem per se.

If you, Niu Danny, create your own regex engine, for your very own
work tasks, you can do whatever you want?
Why do you care whether this is compatible to POSIX?
I think the bytes 0x80-0xFF ... wasn't there an issue that takes
about the (effectively) high bit in the POSIX locale?

Other than that, if you want to, instead, create a regular
expression engine that *really* supports native languages of the
world, you will not be able to do this with POSIX, and also not
ISO C 23 (as far as i have looked).
You will need the ICU library or, you say it, perl is known to
have deeply penetrated that topic.  You need to inspect entire
strings, not individual wchar_t's, you need to look around for
context in certain languages and for certain operations.
The latter is plain.

 |keld
 |
 |On Fri, Jun 20, 2025 at 08:54:47AM +0000, Niu Danny via austin-group-l \
 |at The Open Group wrote:
 |> Do any of us still have recount of our interaction 
 |> with WG20 - Internationalization? 
 |> 
 |> I intend to learn more about the background 
 |> before doing any kind of judgment.
 |> 
 |>> 2025???5???25??? 07:58???Niu Danny via austin-group-l at The Open \
 |>> Group <austin-group-l@opengroup.org> ?????????
 |>> 
 |>> I'd like to query if the following premise makes the following 
 |>> implementation stratagy valid. Not sure if it's on topic though.
 |>> 
 |>> Localization in Unix was intended to sell the system to non-English-spea\
 |>> king
 |>> customers, but nowadays its relevance is decreasing due to the developem\
 |>> ent
 |>> of language models of deployable scales and improved translation \
 |>> algorithms -
 |>> although their accuracy is debated, they're sufficient considering \
 |>> they're 
 |>> primarily just a first-hand built-in source, and users would purchase \
 |>> more professional
 |>> translation softwares or services for work.
 |>> 
 |>> Internationalized regex is supposedly a subsidiary tool to localization \
 |>> for
 |>> text processing, but for a regex engine to be really internationalized, \
 |>> I think
 |>> a character database model is needed, which is easy, as the true \
 |>> boundary
 |>> of a character is not always clear in every culture. I suppose the \
 |>> readers will
 |>> expect Perl to be mentioned, so yes, a large codebase of text processing\
 |>>  tool
 |>> is written in Perl, owing to its more versatile regex and programming \
 |>> language
 |>> syntax, as well as its diverse ecosystem.
 |>> 
 |>> Regex in Unix really is mostly good for system administration - \
 |>> especially for
 |>> tasks that are meant to be automated such as log analysis and incident \
 |>> reports.
 |>> Configuration editing and other tasks that require humen decision, \
 |>> although cannot
 |>> be automated, can be greatly augmented when a useful tool such as \
 |>> regex is 
 |>> available to user.
 |>> 
 |>> I personally find another use of regex where localization prevented \
 |>> me from 
 |>> doing what I need. In web back-end programming, there's the need of 
 |>> "path sanitization" when storing and retrieving files, to prevent \
 |>> malicious 
 |>> client from using crafted path to overwrite or accesss restricted data.
 |>> Due to the regex engine I used at time bundled with internationalization\
 |>>  support,
 |>> I had to install additional dependency during deployment, which wasn't
 |>> discovered during development. Minor anecdote though.
 |>> 
 |>> POSIX already give permit for implementation to support no additional \
 |>> locales than
 |>> the C/POSIX locale, so a regex implementation that hasn't any extension \
 |>> mechanism
 |>> whatsoever, on a system implementation that doesn't support defining \
 |>> additional
 |>> locales is conforming. But here's the part that I'm not sure:
 |>> 
 |>> I want to implement an ASCII-based regex that's simultaneously a \
 |>> byte-based regex,
 |>> POSIX didn't require me to use the exact ASCII character set, so \
 |>> in theory, I have the
 |>> freedom to call the byte values 128-255 [:nonchar:] or [:nonascii:] \
 |>> if I see fit. But in
 |>> this case, I strictly shouldn't advertise charset as ASCII in my \
 |>> environment, yet 
 |>> programs that sees ASCII can assume some properties about the environmen\
 |>> t, but
 |>> such assumption will in turn make them strictly non-portable?
 |>> 
 |>> How do you view these issues? Thanks for your opinion.
 --End of <20250620112339.gb29...@www5.open-std.org>

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: About regex, charset, and localization.

Reply via email to