Re screen-scraping URLs,
The right way to do it is with XSLT, e.g.
http://cyber.com.au/~twb/.bin/fortune-snarf
The quick-and-dirty approach I would normally adopt is:
curl -sL example.net/page.html |
egrep -oi [^\'\"]+.png |
wget -i-
For relative URLs, --base doesn't work for me, so something like this
before the wget:
sed s,^,http://example.net/,g
If the source is split over multiple pages,
map curl -fsL -- example.net/?page={0..999} |
where map is http://cyber.com.au/~twb/.bin/map -- assuming that "bad"
pages return an HTTP 4xx (the -f makes that propagate upwards).
You will also often have to spoof User-Agent (-U/-A) and/or set wget
--referer -- the latter usually only needs to match the original domain,
e.g. --referer=http://example.net/ will usually suffice, rather than
--referer=http://example.net/foo/bar/baz.html
_______________________________________________
luv-main mailing list
[email protected]
http://lists.luv.asn.au/listinfo/luv-main