I am interested in adding support to Nutch for searching Myanmar language text. Myanmar (Burmese) often does not have spaces between words, so the process of segmenting into words is more difficult than just whitespace.
I assume that I need to start by creating a Myanmar Tokenizer and Analyzer in org.apache.lucene.analysis, but what is needed to use this within Nutch? Are there examples of other non-whitespace Tokenizers being used in Nutch? I notice there is a translation for Thai, but I couldn't find any Thai specific segmentation. Out of the box, Nutch seems able to search space delimited Myanmar, but it is usually unable to pick out words without space delimiters. Presumably, I'll need to adapt the code in net.nutch.analysis, but are there other areas that I need to look at as well? Any tips would be much appreciated. thanks, Keith Stribley ------------------------------------------------------- This SF.Net email is sponsored by Yahoo. Introducing Yahoo! Search Developer Network - Create apps using Yahoo! Search APIs Find out how you can build Yahoo! directly into your own Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers