pre-(actually mid-)processing HTML)

Peter Jaspers-Fayer Fri, 17 Aug 2001 11:24:13 -0700
There are some pages I really like, but they have far too much crap on
them (links to other sections, back-links to "home" page, headers,
sidebars, etc. etc. ad nauseum - all very pretty on my PC browser, but
I have to jump down 50-60% to get to the real stuff.) 

http://dailynews.yahoo.com/headlines/human_interest/oddly_enough/
is a case in point.  Because they insert the date in the path, use
of "stayonhost" and "staybelow=" don't help (say, can you wildcard
part of staybelow? - that'd help a lot! - looking at STAYBELOW 
processing in spider.py does not encourage me to think so, but I
hope to be told I'm wrong.)

I would like to pre-download a bunch of pages, filter their content, and
then pass them to the plucker spider/parser/compressor.  I would tailor
the filter for each one (each is different), but the first step eludes me.

Can anyone out there point me to a tool that I can call from a script in
in MSWindows to download an html page?  Something that would look like

webget http://www.foo.com/bar.html > xyzzy.html
trimcrap xyzzy.html > compact.html
(then process compact.html with plucker)

It's the comandline-driven "webget" bit that I'm missing.

Barring that, is there a script-drivable telnet-like beast out there that
is small and fast?

tport 80 www.foo.com "get /bar.html" > xyzzy.html

("tport" would connect to the indicated port on the indicated machine 
transmit an arbitrary command, returning the result as raw text)

If I haven't made myself clear, or if there is a better way, please mail
me.  I know this can probably be done in python, but I had a peek at the
supplied .py files and I have to admit I *ahem* don't quite follow them.
-- 
 /PJ,  Peter Jaspers-Fayer,  ITS  [EMAIL PROTECTED]  (519) 824-4120 x4777
LTC Rm. 1601/ Ontario Veterinary College, UofG/ Guelph, On./ N1G 2W1 Canada.
pre-(actually mid-)processing HTML)

Reply via email to