From: David Starner <[EMAIL PROTECTED]>

   The dict standard dictates that all data crossing the wire shall be in
   UTF-8. Unfortunately, the reference implementation doesn't even try to
   get it right. I was discussing the issue with a maintainer of a Russian
   dictionary for dict, and part of the problem was that there was no UTF-8
   regex engine. Does anyone know of a UTF-8 regex engine, preferably one
   that can be plugged into a GPL'ed C program easily?



What flavor of "regex"?  The regular expression language specified for
XML Schema has the advantage that it permits very fast
implementations.  One such implementation is rx-xml, a GPL'ed
DFA-based matcher that handles UTF-8 and all byte-order variations on
UTF-16 (i.e., be, le, and native).

I haven't looked at the source code for the system you're working on
or the protocol specs.  Rx-xml has features that might or might not be
especially useful, depending on your needs.  For example, it can match
data spread over multiple buffers (as in successive bursts from a
network connection).  It has support for interrupting long-running
matches.  It is highly tunable, permitting applications to trade-off
space for time.  It attempts to handle ill-formed sequences
gracefully.

If you want a Posix UTF-8 matcher, there's work that could be finished
on rx-posix (which uses the same DFA engine).  However, the amount and
difficulty of the necessary work is probably more than you would want
to take on casually.

Unfortunately, the server from which rx-xml is distributed is, at the
moment, quite dead.  I can perhaps provide you with a copy of the
distribution via (outbound) FTP.

-t


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to