2014-02-24 08:53:17 -0800, Jim Meyering: [...] > This is pretty serious: > > $ printf 'p\xc3\xa8re\n' |LC_ALL=en_US.utf8 grep -w p > père
I gets more complicated with combining characters: $ printf 'pe\314\200re\n' | grep -w pe père You can't expect \w to match U+0300 alone. You can't expect \w to match two characters (e with U+0300) either. It feels wrong that grep finds a word boundary inside a single graphem though (between e and its grave accent). I suppose one way to address the problem would be an option that turns anything that matches a single character (., [xy], \w, \s...) into something that matches a graphem, or if not maybe a "combining character sequence" http://www.unicode.org/faq/char_combmark.html for more details. That's not a grep only problem though. I suppose it gets even more complicated with non-latin alphabets or non-alphabetic languages. \w, -w, \b, \<, \> are not "standard" features, so GNU may decide what they want to do with it. Restricting it to ascii a-zA-Z0-9_ (which is not even word constituents in English, but appears to match C identifiers which is probably what it was designed for in the first place) is as good a choice as any I would say. Changing it might break things. Adding other ways to match unicode characters properties (like PCRE's \p{...}) may be a better approach. -- Stephane
