On Fri, 5 Feb 2010 22:05:27 +0100
Gisle Aas <gi...@aas.no> wrote:

> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
> specifies how to pre-scan an HTML document to sniff the charset.
> Would it not be simpler to just implement the algorithm as specified
> instead of using a generic parser.  The use of HTML::Parser to
> implement this sniffing was just me trying a shortcut since
> HTML::Parser seemed to implement a superset of these rules.

Those rules look somewhat involved to me, especially knowing that we already
have both XS and Pure Perl parsers at hand. 

Two thoughts:

1. What about using HTML::Encoding, after adapting it so it has only
conditional dependency on HTML::Parser, and only uses HTML::Parser if 
available. 
(It already tries several detection methods before getting to HTML::Parser):

http://search.cpan.org/~bjoern/HTML-Encoding/

A variation on this idea would for *it* to a pure Perl HTML parser
instead of skipping the HTML parsing check completely. 

2. I note this from the spec page you reference:

"This algorithm is a willful violation of the HTTP specification, which 
requires that the encoding be assumed to be ISO-8859-1 in the absence of a 
character encoding declaration to the contrary, and of RFC 2046, which requires 
that the encoding be assumed to be US-ASCII in the absence of a character 
encoding declaration to the contrary. This specification's third approach is 
motivated by a desire to be maximally compatible with legacy content. [HTTP] 
[RFC2046]"

According to this, we can skip all this encoding-detection work and still be 
HTTP spec compliant (although it might more user-friendly to keep trying to 
guess. )

    Mark


-- 
http://mark.stosberg.com/



Reply via email to