bug#16867: [bug #37600] grep -w cuts words on non-ascii

Stephane Chazelas Mon, 24 Feb 2014 13:40:32 -0800

2014-02-24 08:53:17 -0800, Jim Meyering:
[...]
> This is pretty serious:
> 
>     $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p
>     père


I gets more complicated with combining characters:

$ printf 'pe\314\200re\n' | grep -w pe
père

You can't expect \w to match U+0300 alone. You can't expect \w to
match two characters (e with U+0300) either.

It feels wrong that grep finds a word boundary inside a single
graphem though (between e and its grave accent).

I suppose one way to address the problem would be an option that
turns anything that matches a single character (., [xy], \w,
\s...) into something that matches a graphem, or if not maybe a
"combining character sequence"

http://www.unicode.org/faq/char_combmark.html for more details.

That's not a grep only problem though.

I suppose it gets even more complicated with non-latin alphabets
or non-alphabetic languages.

\w, -w, \b, \<, \> are not "standard" features, so GNU may
decide what they want to do with it. Restricting it to ascii
a-zA-Z0-9_ (which is not even word constituents in English, but
appears to match C identifiers which is probably what it was
designed for in the first place) is as good a choice as any I
would say.

Changing it might break things. Adding other ways to match
unicode characters properties (like PCRE's \p{...}) may be a
better approach.

-- 
Stephane

bug#16867: [bug #37600] grep -w cuts words on non-ascii

Reply via email to