I'd like to query if the following premise makes the following implementation stratagy valid. Not sure if it's on topic though.
Localization in Unix was intended to sell the system to non-English-speaking customers, but nowadays its relevance is decreasing due to the developement of language models of deployable scales and improved translation algorithms - although their accuracy is debated, they're sufficient considering they're primarily just a first-hand built-in source, and users would purchase more professional translation softwares or services for work. Internationalized regex is supposedly a subsidiary tool to localization for text processing, but for a regex engine to be really internationalized, I think a character database model is needed, which is easy, as the true boundary of a character is not always clear in every culture. I suppose the readers will expect Perl to be mentioned, so yes, a large codebase of text processing tool is written in Perl, owing to its more versatile regex and programming language syntax, as well as its diverse ecosystem. Regex in Unix really is mostly good for system administration - especially for tasks that are meant to be automated such as log analysis and incident reports. Configuration editing and other tasks that require humen decision, although cannot be automated, can be greatly augmented when a useful tool such as regex is available to user. I personally find another use of regex where localization prevented me from doing what I need. In web back-end programming, there's the need of "path sanitization" when storing and retrieving files, to prevent malicious client from using crafted path to overwrite or accesss restricted data. Due to the regex engine I used at time bundled with internationalization support, I had to install additional dependency during deployment, which wasn't discovered during development. Minor anecdote though. POSIX already give permit for implementation to support no additional locales than the C/POSIX locale, so a regex implementation that hasn't any extension mechanism whatsoever, on a system implementation that doesn't support defining additional locales is conforming. But here's the part that I'm not sure: I want to implement an ASCII-based regex that's simultaneously a byte-based regex, POSIX didn't require me to use the exact ASCII character set, so in theory, I have the freedom to call the byte values 128-255 [:nonchar:] or [:nonascii:] if I see fit. But in this case, I strictly shouldn't advertise charset as ASCII in my environment, yet programs that sees ASCII can assume some properties about the environment, but such assumption will in turn make them strictly non-portable? How do you view these issues? Thanks for your opinion.