Nutch And Chinese

Jack Tang Mon, 04 Apr 2005 00:00:08 -0700

Chinese in Nutch now: 
1.Chinese words segment in indexing phase is very simple, just one
word by one word.
2.Lack of Chinese stop-words


And my target is to build more sophisticated
NutchAnalysis/FastCharStream to support Chinese well.
Here is my idea:
1. Build Chinese stop-words dictionary.
2. Make Chinese indexer much smarter.
   A). If there is no dictionary, we do indexing automatically using
binary-segment.
   B). Build dictionary via user's query input or import external dictionary.
   C). If there are some dictionaries, we re-fine the index using BMM/FMM.
   D). B) and C) is a close-loop

And any suggestions? Welcome your ideas, esp the Chinese developers
here. If everything is OK, I will add "improvement" into JIRA. Thanks

/Jack

Nutch And Chinese

Reply via email to