Motoharu Kubo wrote:
> However, there is another issue that I did not write so far. In
>
>Japanese and some asian language word can be split without hyphenation.
>Joining lines with space cause problem. Not joining lines can cause
>important but undetected keyword because of line break. I am
>considering this issue right now.
>
>
Perhaps runs of whitespace between two CJK characters should be removed,
prior to tokenization.
>The most time consuming but accurate approach would be tokenize in
>do_body_test if language is "ja" and contents-type is "text/plain"
>
>
I don't think you want to limit it to text/plain. Any sort of text/*
should be tokenized if it is in Japanese.
> I checked the code and found that bayes receives normalized header text
>
>and non-normalized body test.
>
>
This doesn't match what I see. Using your test case message, I show
get_visible_rendered_body_text_array() returning the normalized form.
The patch you include below includes most of my change, but omits the
following hunk. Perhaps the lack of that change is your problem?
@@ -385,7 +411,7 @@
}
else {
$self->{rendered_type} = $self->{type};
- $self->{rendered} = $text;
+ $self->{rendered} = $self->{visible_rendered} = $text;
}
}
>In addition, \xa0 is considered as whitespace but UTF-8 can contain this
>character as second or third byte. The tokenize_line cuts \200-\x240.
>I also changed these problems and bayes seems to receive normalized
>text.
>
>
The problem here is the "use bytes" pragma at the top of
Bayes.pm--you'll want to remove that. Removing it will have some
follow-on consequences--the "use bytes" pragma will probably also have
to be removed from BayesStore and the other Bayes-related modules. The
BayesStore subclasses probably will also have to be modified to become
UTF-8 aware, storing tokens in UTF-8 form.