2015-10-10 03:03, RW wrote:

I'm not seeing any body tokens, even after training.

I was expecting that the text would be tokenized as individual UTF-8
sequences. ASCII characters encoded as UTF-16 and decoded with the
wrong endianness are still valid UTF-16. Normalizing them into
UTF-8 should produce completely multi-byte UTF-8 without whitespace or
punctuation (not counting U+2000 inside UTF-8).

If I add John Hardin's diagnostic rule

body     __ALL_BODY     /.*/
tflags   __ALL_BODY     multiple

I get:

ran body rule __ALL_BODY ======> got hit: " _ _D_e_a_r_
_p_o_t_e_n_c_i_a_l_ _p_a_r_t_n_e_r_,_ _ _W_e_ _a_r_e_
_p_r_o_f_e_s_s_i_o_n_a_l_ _i_n_ _e_n_g_i_n_e_e_r_i_n_g_,_
_...

It looks like it's still UTF-16, and Bayes is seeing individual
letters (which are too short to be tokens) separated by nulls.

The way it works now is if the decoding as declared fails,
and some guessing fails too, it falls back to Windows-1252,
which are single byte characters (superset of ISO-8859-1),
which can't fail, and gives you the result you are seeing
(spaced out by null characters).

If I change the mime to utf-16le it works correctly, except that the
subject isn't converted - including the copy in the body.  If I set the
mime to utf-16le I get what appears to be the multi-byte UTF-8 I was
expecting.

The endoded-word in the Subject header field needs to be
declared as utf-16le too, then it works (tried on trunk).

So SA isn't falling back to big-endian, it wont normalize without an
explicit endianess.

It tries as BE, and when Encode::decode reports a failure, it
decodes as Windows-1252.

BTW with normalize_charset 0 it looks like a spammer can effectively
turn-off body tokenization by using UTF-16 (with correct endianness).

Yes. There are also other tricks that a spammer can't play.
It's not possible to emulate all different behaviours of
various mail reading programs. Still, in the case we have
it would make sense to try also the utf-16le, since this is
a default endianness in Windows.

  Mark

Reply via email to