There are some pages I really like, but they have far too much crap on
them (links to other sections, back-links to "home" page, headers,
sidebars, etc. etc. ad nauseum - all very pretty on my PC browser, but
I have to jump down 50-60% to get to the real stuff.)
http://dailynews.yahoo.com/headlines/human_interest/oddly_enough/
is a case in point. Because they insert the date in the path, use
of "stayonhost" and "staybelow=" don't help (say, can you wildcard
part of staybelow? - that'd help a lot! - looking at STAYBELOW
processing in spider.py does not encourage me to think so, but I
hope to be told I'm wrong.)
I would like to pre-download a bunch of pages, filter their content, and
then pass them to the plucker spider/parser/compressor. I would tailor
the filter for each one (each is different), but the first step eludes me.
Can anyone out there point me to a tool that I can call from a script in
in MSWindows to download an html page? Something that would look like
webget http://www.foo.com/bar.html > xyzzy.html
trimcrap xyzzy.html > compact.html
(then process compact.html with plucker)
It's the comandline-driven "webget" bit that I'm missing.
Barring that, is there a script-drivable telnet-like beast out there that
is small and fast?
tport 80 www.foo.com "get /bar.html" > xyzzy.html
("tport" would connect to the indicated port on the indicated machine
transmit an arbitrary command, returning the result as raw text)
If I haven't made myself clear, or if there is a better way, please mail
me. I know this can probably be done in python, but I had a peek at the
supplied .py files and I have to admit I *ahem* don't quite follow them.
--
/PJ, Peter Jaspers-Fayer, ITS [EMAIL PROTECTED] (519) 824-4120 x4777
LTC Rm. 1601/ Ontario Veterinary College, UofG/ Guelph, On./ N1G 2W1 Canada.