[Nutch-dev] Re: Myanmar Tokeniser

Ken Krugler Tue, 31 May 2005 07:45:08 -0700

Keith Stribley wrote:

I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace.
I assume that I need to start by creating a Myanmar Tokenizer and
Analyzer in org.apache.lucene.analysis, but what is needed to use this
within Nutch? Are there examples of other non-whitespace Tokenizers
being used in Nutch? I notice there is a translation for Thai, but I
couldn't find any Thai specific segmentation.


Out of the box, Nutch seems able to search space delimited Myanmar, but
it is usually unable to pick out words without space delimiters.
Presumably, I'll need to adapt the code in net.nutch.analysis, but are
there other areas that I need to look at as well? Any tips would be much
appreciated.
thanks,

It seems to me that this is a very similar problem to otherlanguages, where spaces don't matter so much, like CJK (Chinese,Japanese, Korean). There were a few threads of discussion on that,please check the archives. There is a bi-gram based tokenizer forsuch languages, perhaps it would work well in your case too.

Burmese (and Thai) segmentation is a bit different than CJKlanguages. For CJK, since most words are one or two characters, youcan "segment" (as far as search goes) by indexing everything as botha single character and all possible two character combinations.

For Burmese and Thai, given the lengths of words, this would resultis a huge explosion in the number of words being indexed. I'd also beworried about false positive problems. But it might be worth a try,since true word segmentation is a much harder problem. For Thai, theapproach I've seen used is to generate a dictionary of the 5-10K mostcommon words, then use a greedy (longest match) parse algorithm,along with a few simple heuristics to force break positions at keycharacter boundaries. This works reasonably well, but to improve theaccuracy further requires adding grammatical info to the worddictionary, a table of bigram grammar pair probabilities (e.g. nounfollowed by verb), etc.

If you search online I'm guessing you'll find a least some researchpapers written on this. Though maybe you've written such a paper, allof the above is old hat to you, and the question is one of how toplug your code into Nutch. In that case yes, look at the Chinesetokenizer as a starting point.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Myanmar Tokeniser

Reply via email to