Re: Charset normalization issue (report, patch, and request)

Justin Mason Fri, 13 Jan 2006 18:21:40 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


"John Myers" writes:
> Motoharu Kubo wrote:
> >>The problem here is the "use bytes" pragma at the top of
> >>Bayes.pm--you'll want to remove that. Removing it will have some
> >>follow-on consequences--the "use bytes" pragma will probably also have
> >>to be removed from BayesStore and the other Bayes-related modules. The
> >>BayesStore subclasses probably will also have to be modified to become
> >>UTF-8 aware, storing tokens in UTF-8 form.
> >
> >I did not change because I think speed is another important factor for
> >mail filter.
> >
> My experience shows that speed only becomes an issue when one ends up
> using Perl's UTF-8 regex engine to evaluate rules. In the case of Bayes,
> I believe correctness is more important. I would have to see a
> significant measured decrease in speed before considering sacrificing
> correctness for speed.
> 
> The fact that the Bayes code confuses A0 bytes in UTF-8 encoded
> characters with the U+00A0 character is one example of an issue that
> would be solved were the "use bytes" pragma removed. To be correct, the
> Bayes database should be storing all tokens in UTF-8, so they match
> regardless of how they are encoded.
> 
> 
> I'm not yet convinced that tokenization belongs inside
> get_rendered_body_text_array() and
> get_visible_rendered_body_text_array(). I suspect the content preview,
> which uses get_rendered_body_text_array(), would look strange were it to
> be tokenized. I am using get_visible_rendered_body_text_array() for
> something which I'm not yet convinced needs tokenization. I think this
> area needs some field experience.

I'm pretty sure it's not appropriate to put tokenization inside
Message at all.

So far, the only code that performs word tokenization is Bayes.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDyGASMJF5cimLx9ARAr0WAJ9jzD/J3uZOdtAolT0VkknRD5d9+gCdGIWz
lJJWuLmqGpoInuO4HKwYgw8=
=bX6P
-----END PGP SIGNATURE-----

Re: Charset normalization issue (report, patch, and request)

Reply via email to