Greets! Does anyone have any ideas on how to deal with a website which uses the frameset tag which points to another host?
This is basically what I want to achieve: - if the page of a website uses a frameset tag pointing to a different host (which will under normal operation be ignored), I want wget to grab a single page from that address (store it locally as if it's part of the current site) and continue normal operation (which will probably mean exiting). The problem that I have is that if a site uses a frameset for *all* it's content, then basically nothing gets downloaded. I don't want to use --span-hosts since it might affect other crawl sessions. I use a patched* wget (v1.10.2) as the guts of a web crawler. Since everyone uses Google, it's search results are the baseline. I've found that this is basically what Google is doing: if the site has a frameset, then grab the content it points to and store that for the parent site. I imagine this can only be achieved with a patch to the source. Since I don't have the time to dig back into wget's source to do this, I'm prepared to personally pay for this change. If anyone feels up to it (and has experience patching wget to add functionality), drop me an email. Thanks Henry --- * my patches: --content-type=LIST comma-separated list of accepted content-types. --content-type-exclude=LIST comma-separated list of rejected content-types. --max-url-len=NUMBER accept maximum NUMBER URL length. --max-files=NUMBER maximum number of files to download.
