On Tue, Feb 24, 2009 at 02:17:26PM +1100, Bron Gondwana wrote: > I'm in the process of rewriting the lib/mkchartable.c > and lib/charset.c with the eventual goal being a more > flexible charset conversion API that can be used to > make sieve rules match on the decoded values, and > other funky things.
OK - significantly more work done. It's now working correctly on my testbed. Behaviour is identical to the old code, using a reasonably large snapshot of email I had lying around. Reconstruct creates an idential cyrus.cache and downloads, etc all work correctly. There are some pathological error cases that might work slightly differently, though I got rid of the worst of those with a change to the couple of charset.t files that had GOSUBs in them. http://github.com/brong/cyrus-imapd/commit/78591d3a5c2f2ed5cd4d1bf935fffa073081198c br...@launde:/extra/src/git/cmu/cyrus-imapd$ git diff origin/master | diffstat b/imap/.cvsignore | 1 b/imap/Makefile.in | 6 b/imap/cyr_charset.c | 160 b/lib/Makefile.in | 15 b/lib/charset.c | 1794 +-- b/lib/charset.h | 22 b/lib/charset/iso-2022-jp.t | 33 b/lib/charset/iso-2022-kr.t | 12 b/lib/charset/unidata5_1.txt |19336 +++++++++++++++++++++++++++++++++++++++++++ b/lib/chartable.h | 27 b/lib/mkchartable.pl | 531 + lib/charset/unidata2.txt | 6629 -------------- lib/mkchartable.c | 974 -- 13 files changed, 20866 insertions(+), 8674 deletions(-) Yikes! So the changes to the .t files just conver the ESC tables into multibyte sequences for all valid escape codes in all mode tables, allowing the "invalid escape code" to drop you back in the current mode again. cyr_charset is just a little tool to allow you to see what the input in a particular charset produces as output. unidata5.1 overrides everything, because that's huge. I've made the code able to support the latest unicode standard including 24 bit codepoints. mkchartable is rewritten in perl rather than C, because it was so very, very much easier. I'd be willing to convert it back if people really, really don't want to depend on Perl, but I'd probably *sigh* a lot. charset.c is pretty much totally rewritten. Git tells me it's 70% changed, and actually logs it as a "rewrite" when committing. Major, major changes to how just about everything works. All "translations" are chainable, so you write code like this: struct convert_rock *translate = qp_init(); struct convert_rock *decode = table_init(charset); struct convert_rock *canon = canon_init(); struct convert_rock *toutf8 = uni_init(); struct convert_rock *tobuffer = buffer_init(0, 0); translate->next = decode; decode->next = canon; canon->next = toutf8; toutf8->next = tobuffer; convert_cat(translate, s); res = buffer_cstring(tobuffer); basic_free(translate); basic_free(decode); basic_free(canon); basic_free(toutf8); buffer_free(tobuffer); And you have a freshly malloced cstring in "res". It's annoying that you have to alloc and free in so many lines, but otherwise the API is simple to use, and easy to mix-and-match as required. Each layer has a "state" object, and gets called with a single character. It changes its state as requried, and possibly calls convert_putc on its "next" pointer with translated characters. Overall, it's a code saving, and it makes doing things like converting to utf8 rather than search form as easy as removing a translation layer. On the downside, it does cost a little more CPU for all the extra function calls, as opposed to the direct A => B translation tables of the old way. Running a full squatter on my mailboxes cost about 10% more CPU this way. I think it's justified for the flexibility and full unicode handling capability and all that jazz. I still need to document how this stuff works, in particular how the almost-stateless search consumer works! I've explained it on paper to Rob and Richard to make sure it makes sense to them... Comments and code review gratefully appreciated! I'll be doing some more testing here, and then possibly pushing it to production on one of our machines for a full smoketest! Bron.
