On Sat, 5 Mar 2005, Robert Barta wrote: > I am using WWW::Mechanize/LWP and some of their subclasses now for > several things and I see an architectural problem I will be facing in > some future: > > For downstream developers (and for me) I need to offer a facility > to choose a user agent which supports a number of features: > > - local caching > > - specialized cookie handling for specific web sites > > - scripting (controlling the user agent via a dedicate language > and not via Perl method calls to WWW::Mechanize). > > - triggering of application specific code at particular events > (page loaded, link selection, page unload) > > - maybe optional JavaScript/DOM coverage later > > Now much of this functionality is already there (I have implemented > scripting recently), but somehow spread over several packages in > incompatible ways. But for a downstream developer it is not possible > to say something like this: > > my $ua = new LWP::UserAgent::Pluggable; > > $ua->add_plugin (new LWP::UserAgent::Plugins::Cache (size => '4M')); > $ua->add_plugin (new LWP::UserAgent::Plugins::Scriptable (plan => ...)); > $ua->add_plugin (new LWP::UserAgent::Plugins::Hooks ( > ('http://specialsite/page' => sub { do something; })); > > Does this make sense?
Yes! Python's urllib2 works like this, so I'm sure looking at that is well worth the time if you want something similar in Perl. I extended it in a fairly simple way in Python 2.4, and it now works quite nicely to support all kinds of things (cookies, auth (various flavours), http, ftp, gopher etc., refresh handling, referer handling, http-equiv, redirection, seek()-able responses, robots.txt observance...) using a single, relatively simple, plugin handler interface. Caching (of both content and connections) would naturally and easily fit into that. Recently noticed the yum package manager / urlgrabber developers have added more features (what I assume are decent implementations of throttling, persistent connections, mirror selection, etc ...), I assume mostly using the same plugin handler system (though they're pretty application-focused). There's no requirement to shoehorn everything into some elegant scheme in order to enable customisation and re-use, though, is there? Module designs need effort expended to keep them open and reusable, true, but that doesn't mean (mythical) perfect genericity (although really generic interfaces can sometimes be just the ticket and very useful, as with urllib2's handlers). A few examples of where, despite urllib2's rather nice handlers, I don't feel a need to fit into any grand generic interface: For cookie policy, I have (in ClientCookie, and now cookielib in Python stdlib), CookiePolicy objects -- *not* a handler -- rather, each cookie handler *has* a CookieJar, which *has* a CookiePolicy. Hooks as you describe might well be done best with explicit support from standard handlers, I would guess (though I woouldn't know for sure 'till I try). Mind you, I have a couple of useful debug handlers, eg. for printing redirected response bodies. Never tried scripting, but I don't see any obvious reason for wanting that as a plugin handler in the urllib2 sense (FWIW, never looked at it, but I know there's a scripting system based on urllib2 + my libraries (in turned based in large part on ports from LWP), called PBP). I've not considered more elaborate generic plugin systems that might offer the opportunity for having eg. this kind of scripting as a plugin to some browser object (too much else more valuable I could do first!), but maybe that'd be an interesting idea to think about a bit. In my port of WWW::Mechanize, I added simple methods back on top of the urllib2 handler system, mostly for convenience of *removing* handlers without rebuilding an opener object each time (eg. Browser.handle_refresh(handle) -- where handle is a boolean arg). Works fairly nicely, I think. I also started on Javascript support. You need a browser model for that (same goes for proper Referer handling, though eg. my mechanize.HTTPRefererProcessor is written as an object that works just like any other handler -- it just happens to use a Browser class in its implementation), so the sort of handlers I refer to above aren't the main issue. See DOMForm and python-spidermonkey here: http://wwwsearch.sourceforge.net/ Enough rambling. Hope this helps stir you to write something interesting and share it... John