I'm working on a multi language spider, and I've come to a point where I'm not sure what assumption to make. If I download a page from a Chinese (for example) server, there is a chance that
1. http header says a Chinese charset but page is in English 2. http header says nothing, meta tag says a Chinese charset, but page is in English 3. http header says nothing, meta tag says Chinese language, page is in Chinese 4. http header says nothing, page is in Chinese I'm not sure how to handle these situations. #4 seems like a no win, I can't tell what charset (or language) to use. In #3, I would use the default charset, which is wrong (at least on my US based system). I don't know of any way to figure out the charset, unless it's by country of server (not always accurate), and even then, a language might have multiple character sets. My question is, what do you believe, the http header, the content-type, the content-language, or the text? And if they disagree, which should I believe in more? Or, have these problems been around long enough that the vast majority of html pages are marked correctly? Thanks, Erick
