I now have working code published which allows HTTP::Message to work without the dependency on HTML::Parser. This is useful because it's a step towards splitting out some of the HTTP modules into their own distribution which does not have this dependency, which in turn depends on a C compiler. So, this project could help allow parts of LWP to be used in places where a C compiler is not available, or when it would be more convenient to distribute one code line that could be used directly on multiple architectures. ( But this is not the only the use of HTML::Parser by the distribution. LWP::UserAgent makes use of HTML::HeadParser which in turn uses HTML::Parser. )
My code is here: http://github.com/markstos/libwww-perl/tree/remove-html-parser-dependency The solution passes the numerous existing tests for charset detection, as well as a new one I added. However, I'm not yet recommending that the work be merged because the approach is not clean. Essentially I have have embedded a fairly full-featured Pure Perl HTML parser into HTTP::Message. :) The code was taken from my fork of the "HTML::Parser::Simple" project and specialized some for this case: http://github.com/markstos/html--parser--simple I think a cleaner approach would be to publish this Pure Perl HTML parser, and then have an option to use it if HTML::Parser is not available. A little history about HTML::Parser::Simple: Ron Savage created the project based on the htmlparser.js JavaScript parser by John Resig. This branch was not trying to be particurly compatible with anything. It defines a new API. In particular, it bundles a parse tree *consumer*, Tree::Simple, as well as parse tree producer. I forked the project and made some incompatible changes to pursue a different goal: Create a pure Perl HTML parser that is compatible with the HTML::Parser API. Or specifically, I wanted emulate the HTML::Parser 2.x API enough so that my parser could be used in place of it with HTML::FillInForm. My work met that goal-- it can be used to pass all HTML::FillInForm tests with some minor failures that I don't think matter. This new case of parsing meta tags is another specialized use of the parser that gives me another reason to publish the work. Here's the problem: While I care about these specific goals for an HTML::Parser that is "compatible enough", I'm not really interested in personally pursuing the idea of a Pure Perl HTML::Parser that is 100% compatible with HTML::Parser just for the sake of it. In short, I'm sure there will be change requests beyond what the uses I care about, and I'm not interested in maintaining the module to extend it for other uses. There's also the matter of what to name it, since a version of HTML::Parser::Simple already exists. There's always HTML::Parser::PP or HTML::Parser::PurePerl, but those names just invite the idea that the goal is to be 100% compatible with HTML::Parser. I'll discuss the matter further with Ron Savage to get his thoughts. Feedback on the topic from other LWP users is welcome. Mark -- http://mark.stosberg.com/