Re screen-scraping URLs,

The right way to do it is with XSLT, e.g.

    http://cyber.com.au/~twb/.bin/fortune-snarf

The quick-and-dirty approach I would normally adopt is:

    curl -sL example.net/page.html |
    egrep -oi [^\'\"]+.png |
    wget -i-

For relative URLs, --base doesn't work for me, so something like this
before the wget:

    sed s,^,http://example.net/,g

If the source is split over multiple pages,

    map curl -fsL -- example.net/?page={0..999} |

where map is http://cyber.com.au/~twb/.bin/map -- assuming that "bad"
pages return an HTTP 4xx (the -f makes that propagate upwards).

You will also often have to spoof User-Agent (-U/-A) and/or set wget
--referer -- the latter usually only needs to match the original domain,
e.g. --referer=http://example.net/ will usually suffice, rather than
--referer=http://example.net/foo/bar/baz.html

_______________________________________________
luv-main mailing list
[email protected]
http://lists.luv.asn.au/listinfo/luv-main

Reply via email to