Chinese in Nutch now: 1.Chinese words segment in indexing phase is very simple, just one word by one word. 2.Lack of Chinese stop-words
And my target is to build more sophisticated NutchAnalysis/FastCharStream to support Chinese well. Here is my idea: 1. Build Chinese stop-words dictionary. 2. Make Chinese indexer much smarter. A). If there is no dictionary, we do indexing automatically using binary-segment. B). Build dictionary via user's query input or import external dictionary. C). If there are some dictionaries, we re-fine the index using BMM/FMM. D). B) and C) is a close-loop And any suggestions? Welcome your ideas, esp the Chinese developers here. If everything is OK, I will add "improvement" into JIRA. Thanks /Jack
