The current Nutch language identifier plugin currently doesn't handle CJKV pages. Does anybody here have any experience with automatically detecting the language of such pages?
I know there are specific encodings which give away what language the page is, but for Asian language pages that use unicode or its variants, I'm out of luck. Andy ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
