Re: case-insensitive grep with accented letters

Geoff Steckel Sun, 01 Jun 2025 12:48:56 -0700

On 6/1/25 11:10 AM, Ingo Schwarze wrote:

Hello,


Stuart Henderson wrote on Sat, May 31, 2025 at 10:45:17AM -0000:

On 2025-05-31, rsyk...@disroot.org <rsyk...@disroot.org> wrote:

I was surprised to learn that 'grep -i' does not
really work for accented letters

OpenBSD base doesn't support LC_COLLATE.

Everything that sthen@ said is correct.

Let me add that supporting LC_COLLATE is not even a long-term goal.

LC_COLLATE is among the most complicated aspects of locales.
The collation order depends on the language, and for some
languages, there is even more than one collation order that
is commonly used.  We certainly do not want to poison our libc
with that amount of complexity.

That said, implementing 'grep -i' for non-ASCII characters does not
strictly require LC_COLLATE support (as opposed to, for example,
sort(1) might).  What *is* needed is working towlower(3) support
in libc, which is controlled by LC_CTYPE, and which we do have (and
it is reasonably up to date because our libc Unicode support follows
Perl, currently at Unicode Version 15.0.0, released in September
2022).

For example, towlower(U+017D) works for me and returns U+017E.

Your desire requires wide-character support in both regexec(3)
and grep(1) such that (1) U+017D can be recognized as a character
rather than being treated as two bytes and (2) towlower can
transform it to U+017E and (3) the result can then be compared
to the command line argument in a wchar_t to wchar_t comparison.
These are multiple tasks of significant difficulty and size.

Maybe, as a partial solution, it would even be possible to improve
*only* grep(1) while leaving the (even more scary) regexec(3)
alone, i.e. have grep(1), when called with -i, convert both
the command line arguments and every input line to lower case
with towlower(3), then pass both to the narrow-character regexec(3),
which should work for your use case.  It would not work for other
use cases though; for example, /./ still wouldn't match an accented
character.

Yours,
   Ingo

(Obviously) mapping unicode->ascii is complex and a pain
The NIH National Library of Medicine has java tools whose distribution
conditions -might- be acceptable
https://lhncbc.nlm.nih.gov/LSG/Projects/lvg/current/web/termsAndConditions.html
The documentation describes the various mappings one might use.

geoff steckel

Re: case-insensitive grep with accented letters

Reply via email to