Keith Stribley wrote:
I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace.
I assume that I need to start by creating a Myanmar Tokenizer and
Analyzer in org.apache.lucene.analysis, but what is needed to use this
within Nutch? Are there examples of other non-whitespace Tokenizers
being used in Nutch? I notice there is a translation for Thai, but I
couldn't find any Thai specific segmentation.

Out of the box, Nutch seems able to search space delimited Myanmar, but
it is usually unable to pick out words without space delimiters.
Presumably, I'll need to adapt the code in net.nutch.analysis, but are
there other areas that I need to look at as well? Any tips would be much
appreciated.
thanks,

It seems to me that this is a very similar problem to other languages, where spaces don't matter so much, like CJK (Chinese, Japanese, Korean). There were a few threads of discussion on that, please check the archives. There is a bi-gram based tokenizer for such languages, perhaps it would work well in your case too.

Burmese (and Thai) segmentation is a bit different than CJK languages. For CJK, since most words are one or two characters, you can "segment" (as far as search goes) by indexing everything as both a single character and all possible two character combinations.

For Burmese and Thai, given the lengths of words, this would result is a huge explosion in the number of words being indexed. I'd also be worried about false positive problems. But it might be worth a try, since true word segmentation is a much harder problem. For Thai, the approach I've seen used is to generate a dictionary of the 5-10K most common words, then use a greedy (longest match) parse algorithm, along with a few simple heuristics to force break positions at key character boundaries. This works reasonably well, but to improve the accuracy further requires adding grammatical info to the word dictionary, a table of bigram grammar pair probabilities (e.g. noun followed by verb), etc.

If you search online I'm guessing you'll find a least some research papers written on this. Though maybe you've written such a paper, all of the above is old hat to you, and the question is one of how to plug your code into Nutch. In that case yes, look at the Chinese tokenizer as a starting point.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to