Re: [Bug-wget] How do I tell wget not to follow links in a file?

2011-04-11 Thread David Skalinder
Okay, I have filed bug #33044 for this issue at
https://savannah.gnu.org/bugs/index.php?33044.  I've also moved the demo
to http://davidskalinder.com/wgettest/ and added a bunch of directories to
the unwanted link page to make the problem clearer.

It strikes me that this issue must come up fairly frequently, especially
for sites with fairly flat directory hierarchies.  For example, any site
which keeps a recent updates page that includes a link to a previous
updates page, both of which contain links to many root-level directories,
would be affected.  A user who wanted to maintain an up-to-date mirror of
such a site would have no option but to download the entire site every
week.

HTH

DS


 On 04/07/2011 05:26 AM, Giuseppe Scrivano wrote:
 David Skalinder da...@skalinder.net writes:

 I want to mirror part of a website that contains two links pages,
 each
 of
 which contains links to many root-level directories and also to the
 other
 links page.  I want to download recursively all the links from one
 links
 page, but not from the other: that is, I want to tell wget download
 links1 and follow all of its links, but do not download or follow
 links
 from links2.

 I've put a demo of this problem up at http://fangjaw.com/wgettest --
 there
 is a diagram there that might state the problem more clearly.

 This functionality seems so basic that I assume I must be overlooking
 something.  Clearly wget has been designed to give users control over
 which files they download; but all I can find is that -X controls
 both
 saving and link-following at the directory level, while -R controls
 saving
 at the file level but still follows links from unsaved files.

 why doesn't -X work in the scenario you have described?  If all links
 from `links2' are under /B, you can exclude them using something like:

 That scenario seems rather unlikely, unless we're talking about
 autogenerated folder index files...

 This issue would be resolved if wget had a way to avoid its current
 behavior of always unconditionally downloading HTML files regardless of
 what rejection rules say. Then you can just reject that single file (and
 if need be, download it as part of a separate session.

 --
 Micah J. Cowan
 http://micah.cowan.name/


 I think that's right.  As I mention on the demo page, links2 could easily
 contain links to hundreds of different directories, in which case you're
 out of luck.

 As Micah notes, if -R did not download the files at all (or even just
 downloaded them but did not queue their links), that should fix the
 problem.  Also, if a user could alter the robots.txt file, I think she
 could make wget act correctly by including something like

 User-agent: *
 Disallow: wgettest/links2.html

 But obviously, most wget users won't have access to the server side.
 Since (I assume) wget knows how to follow that robots instruction, it
 seems like it should be able to follow a similar instruction from the
 client side.

 David








Re: [Bug-wget] How do I tell wget not to follow links in a file?

2011-04-11 Thread David Skalinder
It just occurred to me that since wget will perform this task properly if
it gets the rule from robots.txt, maybe this issue could be worked around
by proxying or spoofing the remote site's robots.txt file locally?  That
is, I write

User-agent: *
Disallow: wgettest/links2.html

into a file, save it in my home directory, and then somehow tell wget that
davidskalinder.com/robots.txt is actually located at
/home/user/robots.txt?

Does anybody know a convenient way of doing this?  Or is there an easier
workaround I'm overlooking?




Re: [Bug-wget] How do I tell wget not to follow links in a file?

2011-04-08 Thread David Skalinder
 On 04/07/2011 05:26 AM, Giuseppe Scrivano wrote:
 David Skalinder da...@skalinder.net writes:

 I want to mirror part of a website that contains two links pages, each
 of
 which contains links to many root-level directories and also to the
 other
 links page.  I want to download recursively all the links from one
 links
 page, but not from the other: that is, I want to tell wget download
 links1 and follow all of its links, but do not download or follow
 links
 from links2.

 I've put a demo of this problem up at http://fangjaw.com/wgettest --
 there
 is a diagram there that might state the problem more clearly.

 This functionality seems so basic that I assume I must be overlooking
 something.  Clearly wget has been designed to give users control over
 which files they download; but all I can find is that -X controls both
 saving and link-following at the directory level, while -R controls
 saving
 at the file level but still follows links from unsaved files.

 why doesn't -X work in the scenario you have described?  If all links
 from `links2' are under /B, you can exclude them using something like:

 That scenario seems rather unlikely, unless we're talking about
 autogenerated folder index files...

 This issue would be resolved if wget had a way to avoid its current
 behavior of always unconditionally downloading HTML files regardless of
 what rejection rules say. Then you can just reject that single file (and
 if need be, download it as part of a separate session.

 --
 Micah J. Cowan
 http://micah.cowan.name/


I think that's right.  As I mention on the demo page, links2 could easily
contain links to hundreds of different directories, in which case you're
out of luck.

As Micah notes, if -R did not download the files at all (or even just
downloaded them but did not queue their links), that should fix the
problem.  Also, if a user could alter the robots.txt file, I think she
could make wget act correctly by including something like

User-agent: *
Disallow: wgettest/links2.html

But obviously, most wget users won't have access to the server side. 
Since (I assume) wget knows how to follow that robots instruction, it
seems like it should be able to follow a similar instruction from the
client side.

David




Re: [Bug-wget] How do I tell wget not to follow links in a file?

2011-04-07 Thread David Skalinder
Well...  Shall I file a bug report for this issue?  This seems to be core
functionality for a program like wget, and frankly I'm a little surprised
that such a fundamental bug would exist in such a mature utility.

So if I'm missing something, I'm happy to be corrected.  But otherwise I
guess I'll write it up over at
http://savannah.gnu.org/bugs/?group=wget...?

David


 Hello,

 I'm trying to use wget to do something that seems very simple, but I
 haven't been able to find a solution anywhere and I'm hoping someone here
 could point me in the right direction.

 I want to mirror part of a website that contains two links pages, each of
 which contains links to many root-level directories and also to the other
 links page.  I want to download recursively all the links from one links
 page, but not from the other: that is, I want to tell wget download
 links1 and follow all of its links, but do not download or follow links
 from links2.

 I've put a demo of this problem up at http://fangjaw.com/wgettest -- there
 is a diagram there that might state the problem more clearly.

 This functionality seems so basic that I assume I must be overlooking
 something.  Clearly wget has been designed to give users control over
 which files they download; but all I can find is that -X controls both
 saving and link-following at the directory level, while -R controls saving
 at the file level but still follows links from unsaved files.

 Is there an obvious solution I'm missing?  Or a manual section I don't
 have or something?

 Thanks in advance,

 Fang

 (PS: wget I'm  using is 1.12.)








Re: [Bug-wget] How do I tell wget not to follow links in a file?

2011-04-07 Thread Giuseppe Scrivano
David Skalinder da...@skalinder.net writes:

 I want to mirror part of a website that contains two links pages, each of
 which contains links to many root-level directories and also to the other
 links page.  I want to download recursively all the links from one links
 page, but not from the other: that is, I want to tell wget download
 links1 and follow all of its links, but do not download or follow links
 from links2.

 I've put a demo of this problem up at http://fangjaw.com/wgettest -- there
 is a diagram there that might state the problem more clearly.

 This functionality seems so basic that I assume I must be overlooking
 something.  Clearly wget has been designed to give users control over
 which files they download; but all I can find is that -X controls both
 saving and link-following at the directory level, while -R controls saving
 at the file level but still follows links from unsaved files.

why doesn't -X work in the scenario you have described?  If all links
from `links2' are under /B, you can exclude them using something like:

wget -r -Xwgettest/B http://fangjaw.com/wgettest

Cheers,
Giuseppe



Re: [Bug-wget] How do I tell wget not to follow links in a file?

2011-04-07 Thread Micah Cowan
On 04/07/2011 05:26 AM, Giuseppe Scrivano wrote:
 David Skalinder da...@skalinder.net writes:
 
 I want to mirror part of a website that contains two links pages, each of
 which contains links to many root-level directories and also to the other
 links page.  I want to download recursively all the links from one links
 page, but not from the other: that is, I want to tell wget download
 links1 and follow all of its links, but do not download or follow links
 from links2.

 I've put a demo of this problem up at http://fangjaw.com/wgettest -- there
 is a diagram there that might state the problem more clearly.

 This functionality seems so basic that I assume I must be overlooking
 something.  Clearly wget has been designed to give users control over
 which files they download; but all I can find is that -X controls both
 saving and link-following at the directory level, while -R controls saving
 at the file level but still follows links from unsaved files.
 
 why doesn't -X work in the scenario you have described?  If all links
 from `links2' are under /B, you can exclude them using something like:

That scenario seems rather unlikely, unless we're talking about
autogenerated folder index files...

This issue would be resolved if wget had a way to avoid its current
behavior of always unconditionally downloading HTML files regardless of
what rejection rules say. Then you can just reject that single file (and
if need be, download it as part of a separate session.

-- 
Micah J. Cowan
http://micah.cowan.name/