Dan Kogai <[EMAIL PROTECTED]> writes: > On 2002.01.31, at 21:44, Tatsuhiko Miyagawa wrote: > > On Thu, 31 Jan 2002 12:31:58 +0000 > > Jean-Michel Hiver <[EMAIL PROTECTED]> wrote: > >> Any ideas? > > > > Try kakasi or Chasen. They can be accessed via Perl with XS > > wrapper. > > Tokenizing Japanese is among the hardest. Even the very notion of > token differs among linguists. For instance.... > > WATASHI-wa Perl-de CGI-wo KAkimasu > > Large cap part is in Kanji and small cap part is in hiragana. > See the first part WATASHI-ha. This corresponds to English 'I'. But > comes in two parts. WATASHI, which means "First party" and "wa" > (spelled "ha" but pronounced "wa"), which makes the previous word > nominative. Now the question is whether "WATASHI-wa" is a single > token or two.
One token if you do "Bunsetsu" tokenization, two token if you do lemata (dictionary entry) tokenization. > So please note there is no silver bullet for Japanese tokenization. > Kakasi / Chasen is good enough for search engines like Namazu but that > does not mean the very tokens they spit are canonical. As you see, > there is no "Canonical Japanese" in a sense "Canonical French" by > Academe France :) Unfortunatly, here Dan is right. If the meaning of the word "word" can get you into week long discussions with japanese linguists. > There is even more radical approach when it comes to search engine. > You can now search arbitrary byte stream WITHOUT tokenization at > all. You use an algorithm called suffix array. The concept is > deceptively simple but for some reason this was not found until > 1990's. To get an idea of what suffix array is, search for 'suffix > array' on Google. Interestingly, the first hit goes to > sary.namazu.org. I think If you want to try suffix arrays "sufary" http://cl.aist-nara.ac.jp/lab/nlt/ss/ is worth a try. A xs-module for Perl is included in the distribution. Andreas