Re: [Bug-wget] How do I tell wget not to follow links in a file?
Okay, I have filed bug #33044 for this issue at https://savannah.gnu.org/bugs/index.php?33044. I've also moved the demo to http://davidskalinder.com/wgettest/ and added a bunch of directories to the unwanted link page to make the problem clearer. It strikes me that this issue must come up fairly frequently, especially for sites with fairly flat directory hierarchies. For example, any site which keeps a recent updates page that includes a link to a previous updates page, both of which contain links to many root-level directories, would be affected. A user who wanted to maintain an up-to-date mirror of such a site would have no option but to download the entire site every week. HTH DS On 04/07/2011 05:26 AM, Giuseppe Scrivano wrote: David Skalinder da...@skalinder.net writes: I want to mirror part of a website that contains two links pages, each of which contains links to many root-level directories and also to the other links page. I want to download recursively all the links from one links page, but not from the other: that is, I want to tell wget download links1 and follow all of its links, but do not download or follow links from links2. I've put a demo of this problem up at http://fangjaw.com/wgettest -- there is a diagram there that might state the problem more clearly. This functionality seems so basic that I assume I must be overlooking something. Clearly wget has been designed to give users control over which files they download; but all I can find is that -X controls both saving and link-following at the directory level, while -R controls saving at the file level but still follows links from unsaved files. why doesn't -X work in the scenario you have described? If all links from `links2' are under /B, you can exclude them using something like: That scenario seems rather unlikely, unless we're talking about autogenerated folder index files... This issue would be resolved if wget had a way to avoid its current behavior of always unconditionally downloading HTML files regardless of what rejection rules say. Then you can just reject that single file (and if need be, download it as part of a separate session. -- Micah J. Cowan http://micah.cowan.name/ I think that's right. As I mention on the demo page, links2 could easily contain links to hundreds of different directories, in which case you're out of luck. As Micah notes, if -R did not download the files at all (or even just downloaded them but did not queue their links), that should fix the problem. Also, if a user could alter the robots.txt file, I think she could make wget act correctly by including something like User-agent: * Disallow: wgettest/links2.html But obviously, most wget users won't have access to the server side. Since (I assume) wget knows how to follow that robots instruction, it seems like it should be able to follow a similar instruction from the client side. David
Re: [Bug-wget] How do I tell wget not to follow links in a file?
It just occurred to me that since wget will perform this task properly if it gets the rule from robots.txt, maybe this issue could be worked around by proxying or spoofing the remote site's robots.txt file locally? That is, I write User-agent: * Disallow: wgettest/links2.html into a file, save it in my home directory, and then somehow tell wget that davidskalinder.com/robots.txt is actually located at /home/user/robots.txt? Does anybody know a convenient way of doing this? Or is there an easier workaround I'm overlooking?
Re: [Bug-wget] How do I tell wget not to follow links in a file?
On 04/07/2011 05:26 AM, Giuseppe Scrivano wrote: David Skalinder da...@skalinder.net writes: I want to mirror part of a website that contains two links pages, each of which contains links to many root-level directories and also to the other links page. I want to download recursively all the links from one links page, but not from the other: that is, I want to tell wget download links1 and follow all of its links, but do not download or follow links from links2. I've put a demo of this problem up at http://fangjaw.com/wgettest -- there is a diagram there that might state the problem more clearly. This functionality seems so basic that I assume I must be overlooking something. Clearly wget has been designed to give users control over which files they download; but all I can find is that -X controls both saving and link-following at the directory level, while -R controls saving at the file level but still follows links from unsaved files. why doesn't -X work in the scenario you have described? If all links from `links2' are under /B, you can exclude them using something like: That scenario seems rather unlikely, unless we're talking about autogenerated folder index files... This issue would be resolved if wget had a way to avoid its current behavior of always unconditionally downloading HTML files regardless of what rejection rules say. Then you can just reject that single file (and if need be, download it as part of a separate session. -- Micah J. Cowan http://micah.cowan.name/ I think that's right. As I mention on the demo page, links2 could easily contain links to hundreds of different directories, in which case you're out of luck. As Micah notes, if -R did not download the files at all (or even just downloaded them but did not queue their links), that should fix the problem. Also, if a user could alter the robots.txt file, I think she could make wget act correctly by including something like User-agent: * Disallow: wgettest/links2.html But obviously, most wget users won't have access to the server side. Since (I assume) wget knows how to follow that robots instruction, it seems like it should be able to follow a similar instruction from the client side. David
Re: [Bug-wget] How do I tell wget not to follow links in a file?
Well... Shall I file a bug report for this issue? This seems to be core functionality for a program like wget, and frankly I'm a little surprised that such a fundamental bug would exist in such a mature utility. So if I'm missing something, I'm happy to be corrected. But otherwise I guess I'll write it up over at http://savannah.gnu.org/bugs/?group=wget...? David Hello, I'm trying to use wget to do something that seems very simple, but I haven't been able to find a solution anywhere and I'm hoping someone here could point me in the right direction. I want to mirror part of a website that contains two links pages, each of which contains links to many root-level directories and also to the other links page. I want to download recursively all the links from one links page, but not from the other: that is, I want to tell wget download links1 and follow all of its links, but do not download or follow links from links2. I've put a demo of this problem up at http://fangjaw.com/wgettest -- there is a diagram there that might state the problem more clearly. This functionality seems so basic that I assume I must be overlooking something. Clearly wget has been designed to give users control over which files they download; but all I can find is that -X controls both saving and link-following at the directory level, while -R controls saving at the file level but still follows links from unsaved files. Is there an obvious solution I'm missing? Or a manual section I don't have or something? Thanks in advance, Fang (PS: wget I'm using is 1.12.)
Re: [Bug-wget] How do I tell wget not to follow links in a file?
David Skalinder da...@skalinder.net writes: I want to mirror part of a website that contains two links pages, each of which contains links to many root-level directories and also to the other links page. I want to download recursively all the links from one links page, but not from the other: that is, I want to tell wget download links1 and follow all of its links, but do not download or follow links from links2. I've put a demo of this problem up at http://fangjaw.com/wgettest -- there is a diagram there that might state the problem more clearly. This functionality seems so basic that I assume I must be overlooking something. Clearly wget has been designed to give users control over which files they download; but all I can find is that -X controls both saving and link-following at the directory level, while -R controls saving at the file level but still follows links from unsaved files. why doesn't -X work in the scenario you have described? If all links from `links2' are under /B, you can exclude them using something like: wget -r -Xwgettest/B http://fangjaw.com/wgettest Cheers, Giuseppe
Re: [Bug-wget] How do I tell wget not to follow links in a file?
On 04/07/2011 05:26 AM, Giuseppe Scrivano wrote: David Skalinder da...@skalinder.net writes: I want to mirror part of a website that contains two links pages, each of which contains links to many root-level directories and also to the other links page. I want to download recursively all the links from one links page, but not from the other: that is, I want to tell wget download links1 and follow all of its links, but do not download or follow links from links2. I've put a demo of this problem up at http://fangjaw.com/wgettest -- there is a diagram there that might state the problem more clearly. This functionality seems so basic that I assume I must be overlooking something. Clearly wget has been designed to give users control over which files they download; but all I can find is that -X controls both saving and link-following at the directory level, while -R controls saving at the file level but still follows links from unsaved files. why doesn't -X work in the scenario you have described? If all links from `links2' are under /B, you can exclude them using something like: That scenario seems rather unlikely, unless we're talking about autogenerated folder index files... This issue would be resolved if wget had a way to avoid its current behavior of always unconditionally downloading HTML files regardless of what rejection rules say. Then you can just reject that single file (and if need be, download it as part of a separate session. -- Micah J. Cowan http://micah.cowan.name/