Re: Japanese tokenization problem

Andreas Marcel Riechert Thu, 31 Jan 2002 07:16:32 -0800

Dan Kogai <[EMAIL PROTECTED]> writes:

> On 2002.01.31, at 21:44, Tatsuhiko Miyagawa wrote:
> > On Thu, 31 Jan 2002 12:31:58 +0000
> > Jean-Michel Hiver <[EMAIL PROTECTED]> wrote:
> >> Any ideas?
> >
> > Try kakasi or Chasen. They can be accessed via Perl with XS
> > wrapper.
> 
>    Tokenizing Japanese is among the hardest.  Even the very notion of
> token differs among linguists.  For instance....
> 
> WATASHI-wa Perl-de CGI-wo KAkimasu
> 
>    Large cap part is in Kanji and small cap part is in hiragana.
> See the first part WATASHI-ha.  This corresponds to English 'I'.  But
> comes in two parts.  WATASHI, which means "First party" and "wa"
> (spelled "ha" but pronounced "wa"), which makes the previous word
> nominative.  Now the question is whether "WATASHI-wa" is a single
> token or two.


One token if you do "Bunsetsu" tokenization, two token if you do
lemata (dictionary entry) tokenization.

>    So please note there is no silver bullet for Japanese tokenization.
> Kakasi / Chasen is good enough for search engines like Namazu but that
> does not mean the very tokens they spit are canonical.  As you see,
> there is no "Canonical Japanese" in a sense "Canonical French" by
> Academe France :)

Unfortunatly, here Dan is right. If the meaning of the word "word"
can get you into week long discussions with japanese linguists.

>    There is even more radical approach when it comes to search engine.
> You can now search arbitrary byte stream WITHOUT tokenization at
> all. You use an algorithm called suffix array.  The concept is
> deceptively simple but for some reason this was not found until
> 1990's.  To get an idea of what suffix array is, search for 'suffix
> array' on Google.  Interestingly, the first hit goes to
> sary.namazu.org.

I think If you want to try suffix arrays  "sufary"  
    http://cl.aist-nara.ac.jp/lab/nlt/ss/
is worth a try. A xs-module for Perl is included in the distribution.

Andreas

Re: Japanese tokenization problem

Reply via email to