Eric Blake wrote: > It is not quite as fast as optimized assembly memchr for a single byte > search, but when searching for two bytes in parallel, it is hands down > faster than two sequential memchr() operations or any naive byte-by-byte > comparisons.
Thanks, I look at the performance and compare them. memchr2() is the fastest for case-insensitive except for large d Solaris or HP-UX. However, kwset doesn't know that `trans' doesn't have counterparts more than one for any character. So we need to check it before run memchr2(). << Linux 86-64 >> memchr2 : real 0.55 user 0.07 sys 0.48 memchr_trans : real 0.71 user 0.14 sys 0.57 my patch small d: real 1.06 user 0.57 sys 0.48 my patch large d: real 0.54 user 0.01 sys 0.52 DFA : real 1.45 user 0.91 sys 0.53 << Solaris 10 >> memchr2 : real 3.17 user 2.45 sys 0.71 memchr_trans : real 4.16 user 3.45 sys 0.70 my patch small d: real 6.43 user 5.71 sys 0.71 my patch large d: real 1.29 user 0.57 sys 0.71 DFA : real 11.08 user 10.35 sys 0.71 << HP-UX ia64 >> memchr2 : real 0.9 user 0.6 sys 0.2 memchr_trans : real 2.5 user 2.3 sys 0.2 my patch small d: real 2.1 user 1.9 sys 0.2 my patch large d: real 0.3 user 0.1 sys 0.2 DFA : real 3.2 user 3.0 sys 0.2 For small d: $ env LANG=C time -p src/grep -i jk ../k For large d: $ env LANG=C time -p src/grep -i kkkkkkkkkkkkkkkkkkkk ../k For DFA: $ env LANG=C time -p src/grep -i 'k\|l' ../k By the way, 0001-grep-speed-up-by-replacing-incr-to-add-in-x86-and-x8.patch is effective still, even when we use memchr2(). It's useful for below, which doesn't reach at memchr2() to run alternately with delta1 searching and delta2 searching. $ yes kjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkjkj | head -10000000 >k $ env LANG=C src/grep -i jj k before: real 0.85 user 0.64 sys 0.21 after : real 0.54 user 0.29 sys 0.25 Norihiro
