> 2025年5月29日 20:37,EuAndreh <e...@euandre.org> 写道:
> 
> TBH I didn't understand.  Unix localization isn't relevant because language 
> models can do translations?  Even if the text is translated, don't you need a 
> system to pick translations?
> 

Yes, we need a system to pick translations, and environment 
variables indicates which translation(s) to use (LANG, 
LC_COLLATE, LC_ALL, etc.) But the content of the translation 
can be supplied by either locale definitions, or language models 
if implementations choose to.

> Isn't this unit a grapheme cluster?

Yes. When a program work with characters, it often 
work with "tokens" (e.g. words in natural language, 
keywords or numerical literals in programming languages), 
so the tokenization of characters are more important 
to programs, grapheme cluster really is just 
the "hallucination"  of human.

> Go for UTF-8, which is compatible with ASCII, and add 128-255 to that custom 
> character class.  If you ever want to expand portability, UTF-8 will be the 
> way yo go.

No, not so quick. UTF-8 is a *variable-length* encoding of
characters, and POSIX specify that quantifiers work with
length of *characters* NOT *byte* (remember I said I want
to implement what's simultaneously a byte-based regex).

UTF-8 is good for handling human text, where as its ASCII
subset/origin is good for deterministic tokenization of
computer-generated strings.

One can of course define a character as a single code unit,
but that'll be a lot more complex, and some feature will
depend of knowledge of specific properties of characters, 
which as I said, would require a dependency on a database.

So the whole point of my original mail on this thread, is
to solicit advice on caution, when I define a character as
a byte in my regex implementation.


  • About regex, charset... Niu Danny via austin-group-l at The Open Group
    • Re: About regex... EuAndreh via austin-group-l at The Open Group
      • Re: About r... Niu Danny via austin-group-l at The Open Group
    • Re: About regex... Niu Danny via austin-group-l at The Open Group
      • Re: About r... k...@keldix.com via austin-group-l at The Open Group
        • Re: Abo... Steffen Nurpmeso via austin-group-l at The Open Group
          • Re:... Niu Danny via austin-group-l at The Open Group

Reply via email to