It seems my first try to send this didn't succeed. Apologies if it dups out.


Dear list,

To tell you the truth this issue has not been raised since I am here, so I
never estimated the required efforts. Is there any example implementation for
it (for UTF caseless compare)?

A bit of background first:

I'm a long-term user of SQLite (a small but powerful embedded DB engine) which, by default doesn't come with Unicode folding, non "lower ASCII" collation and such. Yet you can use the ICU library since SQLite has built-in hooks for it as an extension, but ICU is a huge (circa 18 Mb) and slow baby.

I've had the need to handle a number of laguages in the same DB and decided to write my own extension after looking at what was freely (open source) available. I used previous code as a basis, but changed most of it (including tries) to fix many bugs and tailor it to wider needs.

So I came up with a decently small (~180 kb) extension in C which has its own Unicode tries for folding and casing. It uses Unicode v5.1 specs, to which I added unofficial support for german eszett and a couple other codepoint.

I can't vouch it will do everything perfectly, but feedback from users around the globe shows it isn't that far off the mark.

My requirements went beyond what ICU offers: for instance ICU collation support requires that you choose a precise unique locale for a given comparison. But in the case of (say) a customer DB table, I have people from 38 countries, using spelling/letters from various languages. Choosing a precise locale in this context is meaningless. I simply relied on (Windows) system calls to handle locale-independant compares. I also included a fuzzy compare and a number of other functions.

There is much code in there that can be removed for use along PCRE, so the final result would be even smaller and much simpler.

If ever someone wants to have a look, the source (and a Windows x86 DLL build) is freely downloadable at http://dl.dropbox.com/u/26433628/unifuzz.zip

The source code includes a long comment part, mostly about how to use the SQLite extension functions it offers. I never had the need to try compiling for 64-bit OS, but I don't believe there would be significant issues doing so.

I'm currently not in a position to do much tech work, but I will be glad to help pruning/adapting code if needed. Compiling and testing will be much harder in my context.

Of course, the code comes without any guaranty.

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Reply via email to