About regex, charset, and localization.

Niu Danny via austin-group-l at The Open Group Sat, 24 May 2025 17:55:25 -0700

I'd like to query if the following premise makes the following 
implementation stratagy valid. Not sure if it's on topic though.


Localization in Unix was intended to sell the system to non-English-speaking
customers, but nowadays its relevance is decreasing due to the developement
of language models of deployable scales and improved translation algorithms -
although their accuracy is debated, they're sufficient considering they're 
primarily just a first-hand built-in source, and users would purchase more 
professional
translation softwares or services for work.

Internationalized regex is supposedly a subsidiary tool to localization for
text processing, but for a regex engine to be really internationalized, I think
a character database model is needed, which is easy, as the true boundary
of a character is not always clear in every culture. I suppose the readers will
expect Perl to be mentioned, so yes, a large codebase of text processing tool
is written in Perl, owing to its more versatile regex and programming language
syntax, as well as its diverse ecosystem.

Regex in Unix really is mostly good for system administration - especially for
tasks that are meant to be automated such as log analysis and incident reports.
Configuration editing and other tasks that require humen decision, although 
cannot
be automated, can be greatly augmented when a useful tool such as regex is 
available to user.

I personally find another use of regex where localization prevented me from 
doing what I need. In web back-end programming, there's the need of 
"path sanitization" when storing and retrieving files, to prevent malicious 
client from using crafted path to overwrite or accesss restricted data.
Due to the regex engine I used at time bundled with internationalization 
support,
I had to install additional dependency during deployment, which wasn't
discovered during development. Minor anecdote though.

POSIX already give permit for implementation to support no additional locales 
than
the C/POSIX locale, so a regex implementation that hasn't any extension 
mechanism
whatsoever, on a system implementation that doesn't support defining additional
locales is conforming. But here's the part that I'm not sure:

I want to implement an ASCII-based regex that's simultaneously a byte-based 
regex,
POSIX didn't require me to use the exact ASCII character set, so in theory, I 
have the
freedom to call the byte values 128-255 [:nonchar:] or [:nonascii:] if I see 
fit. But in
this case, I strictly shouldn't advertise charset as ASCII in my environment, 
yet 
programs that sees ASCII can assume some properties about the environment, but
such assumption will in turn make them strictly non-portable?

How do you view these issues? Thanks for your opinion.

About regex, charset, and localization.

Reply via email to