On Sat, Feb 20, 2016 at 8:19 PM, Jim Meyering <j...@meyering.net> wrote: > On Sun, Feb 14, 2016 at 12:02 PM, Ulya Fokanova <skvad...@gmail.com> wrote: >> I've explored the following case: >> >> $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c >> 6 ... >> The bug also present with PCRE engine: >> >> $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c >> 6 >> $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c >> 6 > > Thank you for the analysis and the report. > I have fixed the regex-oriented problem with the attached > patch, but not yet the case using -P -z (PCRE + --null-data):
The -Pz/PCRE problem is more fundamental, and strikes even with LC_ALL=C. This shows that with -Pz, anchors still wrongly match at newlines, rather than at \0 bytes: $ printf '\0a\nb\0' | LC_ALL=C src/grep -Plz '^a' [Exit 1] $ printf '\0a\nb\0' | LC_ALL=C src/grep -Plz '^b' (standard input) Fixing this is on PCRE's maint/README wish list with this item: . Line endings: * Option to use NUL as a line terminator in subject strings. This could now be done relatively easily since the extension to support LF, CR, and CRLF.