At 06:43 PM 4/5/2002 -0800, you wrote:
>I'm working on a multi language spider, and I've come to a point where I'm >not sure what assumption to make. <BIG SNIP> The solution to your problem is to use a language identifier. A language identifier is capable of recognizing not only what language it is but also what character set is in use. So all you need to do is to download the page and throw it at a language identifier and it will tell you what language and character set it is. Or, you could do it at a paragraph at a time just in case you are dealing with a mixed language document. Just so happens we market one. ;-) It supports ~230 languages in a variety of different character sets in addition to UTF-8, and Unicode Big/Little Endian. You can play with a simple demo at: http://www.languageidentifier.com/ (Though Chinese isn't included in the demo.) We developed it originally to assist with doing language specific crawling among other things. Interestingly enough, we are finishing up work on a Chinese text segmentation system. (This puts the spaces into Chinese text so that you can index it and search it more efficiently.) If interested, please contact me at: [EMAIL PROTECTED] -Art -- Art Pollard http://www.lextek.com/ Suppliers of High Performance Text Retrieval Engines.