Chinese in Nutch now: 1.Chinese words segment in indexing phase is very simple, just one word by one word. 2.Lack of Chinese stop-words
And my target is to build more sophisticated NutchAnalysis/FastCharStream to support Chinese well. Here is my idea: 1. Build Chinese stop-words dictionary. 2. Make Chinese indexer much smarter. A). If there is no dictionary, we do indexing automatically using binary-segment. B). Build dictionary via user's query input or import external dictionary. C). If there are some dictionaries, we re-fine the index using BMM/FMM. D). B) and C) is a close-loop And any suggestions? Welcome your ideas, esp the Chinese developers here. If everything is OK, I will add "improvement" into JIRA. Thanks /Jack ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
