I now have working code published which allows HTTP::Message to work without
the dependency on HTML::Parser. This is useful because it's a step towards
splitting out some of the HTTP modules into their own distribution which does
not have this dependency, which in turn depends on a C compiler. So, this
project could help allow parts of LWP to be used in places where a C compiler
is not available, or when it would be more convenient to distribute one code
line that could be used directly on multiple architectures.
( But this is not the only the use of HTML::Parser by the distribution.
LWP::UserAgent makes use of HTML::HeadParser which in turn uses HTML::Parser. )

My code is here:

http://github.com/markstos/libwww-perl/tree/remove-html-parser-dependency

The solution passes the numerous existing tests for charset detection, as well
as a new one I added.

However, I'm not yet recommending that the work be merged because the approach
is not clean.

Essentially I have have embedded a fairly full-featured Pure Perl HTML parser
into HTTP::Message. :) The code was taken from my fork of the
"HTML::Parser::Simple" project and specialized some for this case:

 http://github.com/markstos/html--parser--simple

I think a cleaner approach would be to publish this Pure Perl HTML parser, and
then have an option to use it if HTML::Parser is not available.

A little history about HTML::Parser::Simple:

Ron Savage created the project based on the htmlparser.js JavaScript parser by
John Resig.  This branch was not trying to be particurly compatible with
anything. It defines a new API.  In particular, it bundles a parse tree
*consumer*, Tree::Simple, as well as parse tree producer.

I forked the project and made some incompatible changes to pursue a different
goal: Create a pure Perl HTML parser that is compatible with the HTML::Parser
API. Or specifically, I wanted emulate the HTML::Parser 2.x API enough so that
my parser could be used in place of it with HTML::FillInForm. My work met that
goal-- it can be used to pass all HTML::FillInForm tests with some minor
failures that I don't think matter.

This new case of parsing meta tags is another specialized use of the parser
that gives me another reason to publish the work.

Here's the problem: While I care about these specific goals for an HTML::Parser
that is "compatible enough", I'm not really interested in personally pursuing
the idea of a Pure Perl HTML::Parser that is 100% compatible with HTML::Parser
just for the sake of it.  In short, I'm sure there will be change requests
beyond what the uses I care about, and I'm not interested in maintaining the
module to extend it for other uses.

There's also the matter of what to name it, since a version of
HTML::Parser::Simple already exists. There's always HTML::Parser::PP or
HTML::Parser::PurePerl, but those names just invite the idea that the goal is
to be 100% compatible with HTML::Parser.

I'll discuss the matter further with Ron Savage to get his thoughts.

Feedback on the topic from other LWP users is welcome.

    Mark

-- 
http://mark.stosberg.com/



Reply via email to