From: David Starner <[EMAIL PROTECTED]>
The dict standard dictates that all data crossing the wire shall be in UTF-8. Unfortunately, the reference implementation doesn't even try to get it right. I was discussing the issue with a maintainer of a Russian dictionary for dict, and part of the problem was that there was no UTF-8 regex engine. Does anyone know of a UTF-8 regex engine, preferably one that can be plugged into a GPL'ed C program easily? What flavor of "regex"? The regular expression language specified for XML Schema has the advantage that it permits very fast implementations. One such implementation is rx-xml, a GPL'ed DFA-based matcher that handles UTF-8 and all byte-order variations on UTF-16 (i.e., be, le, and native). I haven't looked at the source code for the system you're working on or the protocol specs. Rx-xml has features that might or might not be especially useful, depending on your needs. For example, it can match data spread over multiple buffers (as in successive bursts from a network connection). It has support for interrupting long-running matches. It is highly tunable, permitting applications to trade-off space for time. It attempts to handle ill-formed sequences gracefully. If you want a Posix UTF-8 matcher, there's work that could be finished on rx-posix (which uses the same DFA engine). However, the amount and difficulty of the necessary work is probably more than you would want to take on casually. Unfortunately, the server from which rx-xml is distributed is, at the moment, quite dead. I can perhaps provide you with a copy of the distribution via (outbound) FTP. -t -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
