-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
"John Myers" writes: > Motoharu Kubo wrote: > >>The problem here is the "use bytes" pragma at the top of > >>Bayes.pm--you'll want to remove that. Removing it will have some > >>follow-on consequences--the "use bytes" pragma will probably also have > >>to be removed from BayesStore and the other Bayes-related modules. The > >>BayesStore subclasses probably will also have to be modified to become > >>UTF-8 aware, storing tokens in UTF-8 form. > > > >I did not change because I think speed is another important factor for > >mail filter. > > > My experience shows that speed only becomes an issue when one ends up > using Perl's UTF-8 regex engine to evaluate rules. In the case of Bayes, > I believe correctness is more important. I would have to see a > significant measured decrease in speed before considering sacrificing > correctness for speed. > > The fact that the Bayes code confuses A0 bytes in UTF-8 encoded > characters with the U+00A0 character is one example of an issue that > would be solved were the "use bytes" pragma removed. To be correct, the > Bayes database should be storing all tokens in UTF-8, so they match > regardless of how they are encoded. > > > I'm not yet convinced that tokenization belongs inside > get_rendered_body_text_array() and > get_visible_rendered_body_text_array(). I suspect the content preview, > which uses get_rendered_body_text_array(), would look strange were it to > be tokenized. I am using get_visible_rendered_body_text_array() for > something which I'm not yet convinced needs tokenization. I think this > area needs some field experience. I'm pretty sure it's not appropriate to put tokenization inside Message at all. So far, the only code that performs word tokenization is Bayes. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Exmh CVS iD8DBQFDyGASMJF5cimLx9ARAr0WAJ9jzD/J3uZOdtAolT0VkknRD5d9+gCdGIWz lJJWuLmqGpoInuO4HKwYgw8= =bX6P -----END PGP SIGNATURE-----
