() Mark H Weaver <m...@netris.org> () Thu, 17 Mar 2011 13:58:42 -0400
* regexp search: The search itself can be implemented bytewise, exactly as if it was a fixed-width encoding. Compiling the regexp can _almost_ be implemented as if the UTF-8-encoded regexp was in a fixed-width encoding, with just one added complication: a multibyte character followed by `*', `?' etc, must be compiled in such a way that the suffix operator applies to the whole character, and not just its final byte. (In practice, it's probably more straightforward to handling compiling somewhat differently than outlined here, but you get the idea). In unibyte land, "." matches a byte. OK. In multibyte land done "bytewise", "." matches ____________. (What goes in the blank?)