I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace. 

I assume that I need to start by creating a Myanmar Tokenizer and
Analyzer in org.apache.lucene.analysis, but what is needed to use this
within Nutch? Are there examples of other non-whitespace Tokenizers
being used in Nutch? I notice there is a translation for Thai, but I
couldn't find any Thai specific segmentation.

Out of the box, Nutch seems able to search space delimited Myanmar, but
it is usually unable to pick out words without space delimiters. 

Presumably, I'll need to adapt the code in net.nutch.analysis, but are
there other areas that I need to look at as well? Any tips would be much
appreciated.
thanks,
Keith Stribley





-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to