Re: [Dovecot] pigeonhole, regex, UTF-8

2010-07-14 Thread Perry E. Metzger
On Tue, 13 Jul 2010 18:16:58 +0200 Stephan Bosch
step...@rename-it.nl wrote:
 As a matter of fact, I haven't looked at TRE before. I'm quite 
 interested though, since it is backwards compatible with POSIX and
 seems to be available in most systems. I'll give it a closer look,
 also in terms of compatibility with the latest draft of the Sieve
 regex extension specification.

TRE has another significant advantage -- the algorithms it uses scale
(for most regexes) linearly, instead of the exponential algorithms
that Spencer-descended regex libraries often use. The difference in
performance can be quite remarkable.

-- 
Perry E. Metzgerpe...@piermont.com


[Dovecot] pigeonhole, regex, UTF-8

2010-07-13 Thread Trever L. Adams

 Hello,

I am just learning about pigeonhole and thinking of using it. I see that 
regex doesn't supportf UTF-8. Any particular reason for this?


If it is a library problem is the library, have you looked at tre? I am 
using it in a project (I am using it in wchar_t mode because elsewhere 
all date is converted to wchar_t). It does work with UTF-8.


Thanks,
Trever


Re: [Dovecot] pigeonhole, regex, UTF-8

2010-07-13 Thread Stephan Bosch

Trever L. Adams wrote:

 Hello,

I am just learning about pigeonhole and thinking of using it. I see 
that regex doesn't supportf UTF-8. Any particular reason for this?
The standard regexp library does not support unicode and I was not 
planning to write my own regexp compiler any time soon.


If it is a library problem is the library, have you looked at tre? I 
am using it in a project (I am using it in wchar_t mode because 
elsewhere all date is converted to wchar_t). It does work with UTF-8.
As a matter of fact, I haven't looked at TRE before. I'm quite 
interested though, since it is backwards compatible with POSIX and seems 
to be available in most systems. I'll give it a closer look, also in 
terms of compatibility with the latest draft of the Sieve regex 
extension specification.


Regards,

Stephan.



Re: [Dovecot] pigeonhole, regex, UTF-8

2010-07-13 Thread Trever L. Adams

 On 07/13/2010 10:16 AM, Stephan Bosch wrote:
The standard regexp library does not support unicode and I was not 
planning to write my own regexp compiler any time soon.

I wouldn't want to write one as well.
As a matter of fact, I haven't looked at TRE before. I'm quite 
interested though, since it is backwards compatible with POSIX and 
seems to be available in most systems. I'll give it a closer look, 
also in terms of compatibility with the latest draft of the Sieve 
regex extension specification.


Regards,

Stephan.



There are a few odd things about the wide character support in TRE. 
Either you need to convert each message to wchar_t and make sure you set 
the system encoding to wchar_t, or you need to set the system encoding 
for each message, which may or may not mess up your UTF-8 regex.


My project is an Internet Classifier (used with things like Squid proxy 
to make a filter). I convert everything to wchar_t (using iconv with 
info gathered from headers) and use the wide character versions of the 
functions. That way I know everything is just fine. I then have the 
program set the system encoding (at least the environment variable for 
the given session) to UTF-8 before I do any of the regex compiling. 
Everything works wonderfully and quite quickly.


I am not sure TRE is available on all systems where dovecot is designed 
to be compiled. I know it is for most, if not all, Unix-like systems. I 
use it in Fedora.


Anyway, thank you your work on pigeonhole.

Trever