I'm working on a multi language spider, and I've come to a point where I'm
not sure what assumption to make. If I download a page from a Chinese (for
example) server, there is a chance that

1. http header says a Chinese charset but page is in English
2. http header says nothing, meta tag says a Chinese charset, but page is in
English
3. http header says nothing, meta tag says Chinese language, page is in
Chinese
4. http header says nothing, page is in Chinese

I'm not sure how to handle these situations. #4 seems like a no win, I can't
tell what charset (or language) to use. In #3, I would use the default
charset, which is wrong (at least on my US based system). I don't know of
any way to figure out the charset, unless it's by country of server (not
always accurate), and even then, a language might have multiple character
sets. My question is, what do you believe, the http header, the
content-type, the content-language, or the text? And if they disagree, which
should I believe in more?

Or, have these problems been around long enough that the vast majority of
html pages are marked correctly?

Thanks,
Erick


Reply via email to