Re: Charset normalization issue (report, patch, and request)

Motoharu Kubo Wed, 11 Jan 2006 16:38:33 -0800

>>o to make splitter function to separate Kakasi processing and moved the
>>  routine from Message/Node.pm to Message.pm.  It will be easier to
>>  replace other program.  This function sipmply returns if text contains
>>  no UTF-8 data, so loss of performance will be minimized for single
>>  byte charsets.
>>
>>  splitter is called from:
>>     get_rendered_body_text_array()
>>     get_visible_rendered_body_text_array()
> 
> 
> Would it be possible to move this to Bayes.pm?   As noted, it's
> Bayes-specific, and this is a more appropriate place.


As you and John suggest, I first think to move this to Bayes.pm, but I
wanted to keep it in Message routine.

It is difficult to describe Japanese matter in English:) but I woud try
the reason.

There is a famous phrase "sumomomomomomomomonouchi" (in hiragana) in
Japanese.  Usually it is written without space, but actually it is
composed of several words: "sumomo mo momo mo momo no uchi"

Suppose "momomomomomo" is the word we want to detect.

(a) If word splitting is omitted, "sumomomomomomomomonouchi" matches.

(b) If word is "sumomomomo\nmomomomonouchi" ("\n" is line break of
course), it doesn't match.

In this example, given phrase should not match.  We should avoid case
(a), this means word splitting is necessary.  For case (b) we can have
correct result but result can vary if "\n" changes.

I define "word splitting" as (1) to join fragment of word split by line
break, and (2) to insert space between words.  I think both are
necessary for header and body check (word match) and bayes.

I think there are some languages other than Japanese which need
tokenization (word splitting) based on dictionary or language specific
logic.  The splitter() can be enhanced by SA developer team or by user.

>>o bayes tokenization for long token.  Original code cuts every two bytes
>>  from top of token.  As multibyte UTF-8 character has at least 3 bytes,
>>  I modified to cut every UTF-8 character.
>>
>>  I am afraid that this change is appropriate or not.
> 
> 
> It may be better to entirely disable the feature that cuts 8-bit strings
> into 2-byte pairs, if Kakasi is in use, since it was intended as a
> low-cost way of generating approximate-tokenized word tokens for Asian
> character sets, and Kakasi does that task more effectively.

It is good to hear that we can disable it.  Omitting this feature will
decrease "noize" and database size.

I will disable this routine if text has UTF-8 character sequence.

-- 
Motoharu Kubo
[EMAIL PROTECTED]

Re: Charset normalization issue (report, patch, and request)

Reply via email to