bug#60690: -P '\d' in GNU and git grep

Paul Eggert Fri, 07 Apr 2023 12:01:31 -0700

On 2023-04-06 06:39, demerphq wrote:

Unicode specifies that \d match any digit
in any script that it supports.

"Specifies" is too strong. The Unicode Regular Expressions technicalstandard (UTS#18) mentions \d only in Annex C[1], next to the word"digit" in a column labeled "Property" (even though \d is really syntaxnot a property). This is at best an informal recommendation, not arequirement, as UTS#18 0.2[2] says that UTS#18's syntax is only forillustration and that although it's similar to Perl's, the two syntaxforms may not be exactly the same. So we can't look to UTS#18 for adefinitive way out of the \d mess, as the Unicode folks specificallydelegated matters to us.

Even ignoring the \d issue the digit situation is messy. UTS#18 Annex Csays "\p{gc=Decimal_Number}" is the standard recommended syntaxassignment for digits. However, PCRE2 does not support this syntax; itsupports another variant \p{Nd} that UTS#18 also recommends. So itappears that PCRE2 already does not implement every recommended aspectof UTS#18 syntax. PCRE2 also doesn't match Perl, which does support"\p{gc=Decimal_Number}".

Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class,that's clearly enough for grep -P to conform to UTS#18 with respect todigits.

A) how do you tell the regular expression
engine what semantics you want and B) how does the regular expression
library identify the encoding in the file, and how does it handle
malformed content in that file.


Here's how GNU grep does it:

* RE semantics are specified via command-line options like -P.

* Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'.

* REs do not match encoding errors.

on *nix there is no tradition of using BOM's to
distinguish the 6 different possible encodings of Unicode (UTF-8,
UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)

Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, OracleUTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframeor MS-Windows code these legacy encodings are obviously a big deal.However, there seems little reason to force their nontrivial hasslesonto every GNU/Linux program that processes text. A few specialized appslike 'iconv' deal with offbeat encodings, and that is probably a betterapproach all around.

there seems
to be some level of desire of matching with unicode semantics against
files that are not uniformly encoded in one of these formats.


That is a use case, yes. It's what 'strings' and 'grep' do.


[1]: https://unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://unicode.org/reports/tr18/#Conformance

bug#60690: -P '\d' in GNU and git grep

Reply via email to