Here is v2 of the patch. It doesn't anymore remove more code than it adds :-) but it should work also for gawk.
Since the support for case-insensitive multibyte matching involves some performance penalty (mostly because dfamust rarely finds a good string) I made it conditional on the GREP symbol. In the future a scheme for more feature bits can be added, but for now it's good. Compared to v1, I added Debian's character-set range patch (patch 2) and fixed the warnings that Jim pointed out. Paolo Bonzini (9): tests: add more UTF-8 test cases dfa: fix handling of ranges in multibyte character sets dfa: rewrite handling of multibyte case_fold lexing dfa: speed up handling of brackets dfa: optimize simple character sets under UTF-8 charsets dfa: cache MB_CUR_MAX for dfaexec dfa: run simple UTF-8 regexps as a single-byte character set grep: remove check_multibyte_string, fix non-UTF8 missed match grep: match multibyte charsets line-by-line when using -i .x-sc_cast_of_argument_to_free | 1 - .x-sc_space_tab | 1 - NEWS | 15 +- src/dfa.c | 957 +++++++++++++++++++++------------------- src/dfa.h | 6 + src/grep.c | 108 ++--- src/search.c | 244 ++++++----- tests/Makefile.am | 7 +- tests/case-fold-backslash-w | 14 + tests/case-fold-char-range | 21 + tests/euc-mb | 23 + tests/foad1.sh | 10 +- tests/spencer1-locale | 24 + tests/spencer1-locale.awk | 30 ++ 14 files changed, 827 insertions(+), 634 deletions(-) delete mode 100644 .x-sc_cast_of_argument_to_free create mode 100755 tests/case-fold-backslash-w create mode 100644 tests/case-fold-char-range create mode 100644 tests/euc-mb create mode 100755 tests/spencer1-locale create mode 100644 tests/spencer1-locale.awk
