OK, here's the algorithm I've come up with.

If an understandable charset is specified in an HTTP header for that
page, use that.

Otherwise:  If the page is text/html, and specifies the charset
internally with a <META> tag, use that.

Otherwise:  If the user has explicitly specified a charset, either in
the .pluckerrc or .ini file, or on the command line, use that charset.

Otherwise:  If the page's URL starts with 'http:' or 'https:', use the
HTTP default charset of ISO-8859-1.

Otherwise:  If a locale-specific charset (obtained by using the Python
locale module) is both specified and understandable, use that.  Note that
this is mainly for file: and plucker: URLs, which seem just right.

Otherwise:  Specify that the charset for the page is unknown.

Bill

Reply via email to