Chinese in Nutch now: 
1.Chinese words segment in indexing phase is very simple, just one
word by one word.
2.Lack of Chinese stop-words

And my target is to build more sophisticated
NutchAnalysis/FastCharStream to support Chinese well.
Here is my idea:
1. Build Chinese stop-words dictionary.
2. Make Chinese indexer much smarter.
   A). If there is no dictionary, we do indexing automatically using
binary-segment.
   B). Build dictionary via user's query input or import external dictionary.
   C). If there are some dictionaries, we re-fine the index using BMM/FMM.
   D). B) and C) is a close-loop

And any suggestions? Welcome your ideas, esp the Chinese developers
here. If everything is OK, I will add "improvement" into JIRA. Thanks

/Jack


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to