On Fri, 5 Feb 2010 22:05:27 +0100 Gisle Aas <gi...@aas.no> wrote: > http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding > specifies how to pre-scan an HTML document to sniff the charset. > Would it not be simpler to just implement the algorithm as specified > instead of using a generic parser. The use of HTML::Parser to > implement this sniffing was just me trying a shortcut since > HTML::Parser seemed to implement a superset of these rules.
Those rules look somewhat involved to me, especially knowing that we already have both XS and Pure Perl parsers at hand. Two thoughts: 1. What about using HTML::Encoding, after adapting it so it has only conditional dependency on HTML::Parser, and only uses HTML::Parser if available. (It already tries several detection methods before getting to HTML::Parser): http://search.cpan.org/~bjoern/HTML-Encoding/ A variation on this idea would for *it* to a pure Perl HTML parser instead of skipping the HTML parsing check completely. 2. I note this from the spec page you reference: "This algorithm is a willful violation of the HTTP specification, which requires that the encoding be assumed to be ISO-8859-1 in the absence of a character encoding declaration to the contrary, and of RFC 2046, which requires that the encoding be assumed to be US-ASCII in the absence of a character encoding declaration to the contrary. This specification's third approach is motivated by a desire to be maximally compatible with legacy content. [HTTP] [RFC2046]" According to this, we can skip all this encoding-detection work and still be HTTP spec compliant (although it might more user-friendly to keep trying to guess. ) Mark -- http://mark.stosberg.com/