[Nutch-dev] Re: Myanmar Tokeniser

2005-05-31 Thread Andrzej Bialecki

Keith Stribley wrote:

I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace. 


I assume that I need to start by creating a Myanmar Tokenizer and
Analyzer in org.apache.lucene.analysis, but what is needed to use this
within Nutch? Are there examples of other non-whitespace Tokenizers
being used in Nutch? I notice there is a translation for Thai, but I
couldn't find any Thai specific segmentation.

Out of the box, Nutch seems able to search space delimited Myanmar, but
it is usually unable to pick out words without space delimiters. 


Presumably, I'll need to adapt the code in net.nutch.analysis, but are
there other areas that I need to look at as well? Any tips would be much
appreciated.
thanks,


It seems to me that this is a very similar problem to other languages, 
where spaces don't matter so much, like CJK (Chinese, Japanese, Korean). 
There were a few threads of discussion on that, please check the 
archives. There is a bi-gram based tokenizer for such languages, perhaps 
it would work well in your case too.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] Re: Myanmar Tokeniser

2005-05-31 Thread Ken Krugler

Keith Stribley wrote:

I am interested in adding support to Nutch for searching Myanmar
language text. Myanmar (Burmese) often does not have spaces between
words, so the process of segmenting into words is more difficult than
just whitespace.
I assume that I need to start by creating a Myanmar Tokenizer and
Analyzer in org.apache.lucene.analysis, but what is needed to use this
within Nutch? Are there examples of other non-whitespace Tokenizers
being used in Nutch? I notice there is a translation for Thai, but I
couldn't find any Thai specific segmentation.

Out of the box, Nutch seems able to search space delimited Myanmar, but
it is usually unable to pick out words without space delimiters.
Presumably, I'll need to adapt the code in net.nutch.analysis, but are
there other areas that I need to look at as well? Any tips would be much
appreciated.
thanks,


It seems to me that this is a very similar problem to other 
languages, where spaces don't matter so much, like CJK (Chinese, 
Japanese, Korean). There were a few threads of discussion on that, 
please check the archives. There is a bi-gram based tokenizer for 
such languages, perhaps it would work well in your case too.


Burmese (and Thai) segmentation is a bit different than CJK 
languages. For CJK, since most words are one or two characters, you 
can segment (as far as search goes) by indexing everything as both 
a single character and all possible two character combinations.


For Burmese and Thai, given the lengths of words, this would result 
is a huge explosion in the number of words being indexed. I'd also be 
worried about false positive problems. But it might be worth a try, 
since true word segmentation is a much harder problem. For Thai, the 
approach I've seen used is to generate a dictionary of the 5-10K most 
common words, then use a greedy (longest match) parse algorithm, 
along with a few simple heuristics to force break positions at key 
character boundaries. This works reasonably well, but to improve the 
accuracy further requires adding grammatical info to the word 
dictionary, a table of bigram grammar pair probabilities (e.g. noun 
followed by verb), etc.


If you search online I'm guessing you'll find a least some research 
papers written on this. Though maybe you've written such a paper, all 
of the above is old hat to you, and the question is one of how to 
plug your code into Nutch. In that case yes, look at the Chinese 
tokenizer as a starting point.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200


---
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers