Hello,
I was working on a couple POSIX regular expressions to search for and
validate IPv4 and IPv6 addresses with optional CIDR blocks, and
encountered some strange behaviour from the base system grep.
I wanted to validate my regex against a list of every valid IPv4
address, so I generated a list with a zsh 1 liner:
for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done |
tr '[:space:]' '\n' > IPv4.txt
My intentions were to test the regex by running it with 'grep -c' to
confirm there was indeed 2^32 addresses matched, and I also wanted to
benchmark and compare performance between BSD grep, GNU grep and
ripgrep. The command I used:
grep -Eoc
"((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
My findings were surprising. Both GNU grep and ripgrep were able get
through the file in roughly 10 and 20 minutes respectively, whereas the
base system grep took over 20 hours! What interested me the most was
that the base system grep when run with '-c' returned '0' for match
count. It seems that 'grep -c' will have its counter overflow if there
are more than 2^32-1 matches (4294967295) and then the counter will
start counting from zero again for further matches.
ryzen$ time zcat IPv4.txt.gz | grep -Eoc "((25[0-5]|(2[0-4]|1{0,1}...
0
1222m09.32s real 1224m28.02s user 1m16.17s system
ryzen$ time zcat allip.txt.gz | ggrep -Eoc "((25[0-5]|(2[0-4]|1{0,1}...
4294967296
10m00.38s real 11m40.57s user 0m30.55s system
ryzen$ time rg -zoc "((25[0-5]|(2[0-4]|1{0,1}...
4294967296
21m06.36s real 27m06.04s user 0m50.08s system
# See the counter overflow/reset:
jot 4294967350 | grep -c "^[[:digit:]]"
54
All testing was done on a Ryzen desktop machine running 6.7 stable.
The grep counting bug can be reproduced with this command:
jot 4294967296 | nice grep -c "^[[:digit:]]"
Regards,
Jordan