Re: Charset normalization issue (report, patch, and request)

John Myers Tue, 10 Jan 2006 15:00:22 -0800

Motoharu Kubo wrote:

> However, there is another issue that I did not write so far. In
>
>Japanese and some asian language word can be split without hyphenation.
>Joining lines with space cause problem.  Not joining lines can cause
>important but undetected keyword because of line break.  I am
>considering this issue right now.
>  
>
Perhaps runs of whitespace between two CJK characters should be removed,
prior to tokenization.


>The most time consuming but accurate approach would be tokenize in
>do_body_test if language is "ja" and contents-type is "text/plain"
>  
>
I don't think you want to limit it to text/plain. Any sort of text/*
should be tokenized if it is in Japanese.

> I checked the code and found that bayes receives normalized header text
>
>and non-normalized body test.
>  
>
This doesn't match what I see. Using your test case message, I show
get_visible_rendered_body_text_array() returning the normalized form.

The patch you include below includes most of my change, but omits the
following hunk. Perhaps the lack of that change is your problem?

@@ -385,7 +411,7 @@
}
else {
$self->{rendered_type} = $self->{type};
- $self->{rendered} = $text;
+ $self->{rendered} = $self->{visible_rendered} = $text;
}
}


>In addition, \xa0 is considered as whitespace but UTF-8 can contain this
>character as second or third byte.  The tokenize_line cuts \200-\x240.
>I also changed these problems and bayes seems to receive normalized
>text.
>  
>
The problem here is the "use bytes" pragma at the top of
Bayes.pm--you'll want to remove that. Removing it will have some
follow-on consequences--the "use bytes" pragma will probably also have
to be removed from BayesStore and the other Bayes-related modules. The
BayesStore subclasses probably will also have to be modified to become
UTF-8 aware, storing tokens in UTF-8 form.

Re: Charset normalization issue (report, patch, and request)

Reply via email to