On 2023-04-06 06:39, demerphq wrote:

Unicode specifies that \d match any digit
in any script that it supports.

"Specifies" is too strong. The Unicode Regular Expressions technical standard (UTS#18) mentions \d only in Annex C[1], next to the word "digit" in a column labeled "Property" (even though \d is really syntax not a property). This is at best an informal recommendation, not a requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for illustration and that although it's similar to Perl's, the two syntax forms may not be exactly the same. So we can't look to UTS#18 for a definitive way out of the \d mess, as the Unicode folks specifically delegated matters to us.

Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C says "\p{gc=Decimal_Number}" is the standard recommended syntax assignment for digits. However, PCRE2 does not support this syntax; it supports another variant \p{Nd} that UTS#18 also recommends. So it appears that PCRE2 already does not implement every recommended aspect of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support "\p{gc=Decimal_Number}".

Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class, that's clearly enough for grep -P to conform to UTS#18 with respect to digits.


A) how do you tell the regular expression
engine what semantics you want and B) how does the regular expression
library identify the encoding in the file, and how does it handle
malformed content in that file.

Here's how GNU grep does it:

* RE semantics are specified via command-line options like -P.

* Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'.

* REs do not match encoding errors.


on *nix there is no tradition of using BOM's to
distinguish the 6 different possible encodings of Unicode (UTF-8,
UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)

Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, Oracle UTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframe or MS-Windows code these legacy encodings are obviously a big deal. However, there seems little reason to force their nontrivial hassles onto every GNU/Linux program that processes text. A few specialized apps like 'iconv' deal with offbeat encodings, and that is probably a better approach all around.


there seems
to be some level of desire of matching with unicode semantics against
files that are not uniformly encoded in one of these formats.

That is a use case, yes. It's what 'strings' and 'grep' do.


[1]: https://unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://unicode.org/reports/tr18/#Conformance




Reply via email to