Re: wget re-download fully downloaded files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Maksim Ivanov wrote: I'm trying to download the same file from the same server, command line I use: wget --debug -o log -c -t 0 --load-cookies=cookie_file http://rapidshare.com/files/153131390/Blind-Test.rar Below attached 2 files: log with 1.9.1 and log with 1.10.2 Both logs are made when Blind-Test.rar was already on my HDD. Sorry for some mess in logs, but russian language used on my console. Thanks very much for providing these, Maksim; they were very helpful. (Sorry for getting back to you so late: it's been busy lately). I've confirmed this behavioral difference (though I compared the current development sources against 1.8.2, rather than 1.10.2 to 1.9.1). Your logs involve a 302 redirection before arriving at the real file, but that's just a red herring. The difference is that when 1.9.1 encountered a server that would respond to a byte-range request with 200 (meaning it doesn't know how to send partial contents), but with a Content-Length value matching the size of the local file, then wget would close the connection and not proceed to redownload. 1.10.2, on the other hand, would just re-download it. Actually, I'll have to confirm this, but I think that current Wget will re-download it, but not overwrite the current content, until it arrives at some content corresponding to bytes beyond the current content. I need to investigate further to see if this change was somehow intentional (though I can't imagine what the reasoning would be); if I don't find a good reason not to, I'll revert this behavior. Probably for the 1.12 release, but I might possibly punt it to 1.13 on the grounds that it's not a recent regression (however, it should really be a quick fix, so most likely it'll be in for 1.12). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJBfOj7M8hyUobTrERAjNTAJ9ayaKLvN4bYS/7o0kYcQywDvfwNgCfcGzz P9aAwVD6Q/xQuACjU7KF1ng= =m5QO -END PGP SIGNATURE-
Re: wget re-download fully downloaded files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Maksim Ivanov wrote: I'm trying to download the same file from the same server, command line I use: wget --debug -o log -c -t 0 --load-cookies=cookie_file http://rapidshare.com/files/153131390/Blind-Test.rar Below attached 2 files: log with 1.9.1 and log with 1.10.2 Both logs are made when Blind-Test.rar was already on my HDD. Sorry for some mess in logs, but russian language used on my console. This is currently being tracked at https://savannah.gnu.org/bugs/?24662 A similar and related bug report is at https://savannah.gnu.org/bugs/?24642 in which the logs show that rapidshare.com issues also issues erroneous Content-Range information when it responds with a 206 Partial Content, which exercised a different regression* introduced in 1.11.x. * It's not really a regression, since it's desirable behavior: we now determine the size of the content from the content-range header, since content-length is often missing or erroneous for partial content. However, in this instance of server error, it resulted in less-desirable behavior than the previous version of Wget. Anyway... - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJBhvA7M8hyUobTrERAty1AKCEscXut6FDXvXlxpuSBtKkii1/awCeJH0M +JcJ5xG67K7CxHBEcV1x/zY= =D2uE -END PGP SIGNATURE-
RE: wget re-download fully downloaded files
Micah Cowan wrote: Actually, I'll have to confirm this, but I think that current Wget will re-download it, but not overwrite the current content, until it arrives at some content corresponding to bytes beyond the current content. I need to investigate further to see if this change was somehow intentional (though I can't imagine what the reasoning would be); if I don't find a good reason not to, I'll revert this behavior. One reason to keep the current behavior is to retain all of the existing content in the event of another partial download that is shorter than the previous one. However, I think that only makes sense if wget is comparing the new content with what is already on disk. Tony
Re: wget re-download fully downloaded files
I'm trying to download the same file from the same server, command line I use: wget --debug -o log -c -t 0 --load-cookies=cookie_file http://rapidshare.com/files/153131390/Blind-Test.rar Below attached 2 files: log with 1.9.1 and log with 1.10.2 Both logs are made when Blind-Test.rar was already on my HDD. Sorry for some mess in logs, but russian language used on my console. Yours faithfully, Maksim Ivanov 2008/10/13 Micah Cowan [EMAIL PROTECTED] -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Maksim Ivanov wrote: Hello! Starting version 1.10 wget has very annoying bug: if you trying download already fully downloaded file, wget begin download it over, but 1.9.1 says: Nothing to do as it must to be. It all depends on what options you specify. That's as true for 1.9 as it is for 1.10 (or the current release 1.11.4). It can also depend on the server; not all of them support timestamping or partial fetches. Please post the minimal log that exhibits the problem you're experiencing. - -- Thanks, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI8mrL7M8hyUobTrERAqx4AJ9yQb+kPXGI2N7sv34krZLnYDuRvgCfWI2K nZYI8ER1PB3pkYC4neiTa9U= =JW3/ -END PGP SIGNATURE- log.1.9.1 Description: Binary data log.1.10.2 Description: Binary data
Re: wget re-download fully downloaded files
I'm trying to download the same file from the same server, command line I use: wget --debug -o log -c -t 0 --load-cookies=cookie_file http://rapidshare.com/files/153131390/Blind-Test.rar Below attached 2 files: log with 1.9.1 and log with 1.10.2 Both logs are made when Blind-Test.rar was already on my HDD. Sorry for some mess in logs, but russian language used on my console. Yours faithfully, Maksim Ivanov 2008/10/13 Micah Cowan [EMAIL PROTECTED] -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Maksim Ivanov wrote: Hello! Starting version 1.10 wget has very annoying bug: if you trying download already fully downloaded file, wget begin download it over, but 1.9.1 says: Nothing to do as it must to be. It all depends on what options you specify. That's as true for 1.9 as it is for 1.10 (or the current release 1.11.4). It can also depend on the server; not all of them support timestamping or partial fetches. Please post the minimal log that exhibits the problem you're experiencing. - -- Thanks, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI8mrL7M8hyUobTrERAqx4AJ9yQb+kPXGI2N7sv34krZLnYDuRvgCfWI2K nZYI8ER1PB3pkYC4neiTa9U= =JW3/ -END PGP SIGNATURE- log.1.9.1 Description: Binary data log.1.10.2 Description: Binary data
Re: wget re-download fully downloaded files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Maksim Ivanov wrote: Hello! Starting version 1.10 wget has very annoying bug: if you trying download already fully downloaded file, wget begin download it over, but 1.9.1 says: Nothing to do as it must to be. It all depends on what options you specify. That's as true for 1.9 as it is for 1.10 (or the current release 1.11.4). It can also depend on the server; not all of them support timestamping or partial fetches. Please post the minimal log that exhibits the problem you're experiencing. - -- Thanks, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI8mrL7M8hyUobTrERAqx4AJ9yQb+kPXGI2N7sv34krZLnYDuRvgCfWI2K nZYI8ER1PB3pkYC4neiTa9U= =JW3/ -END PGP SIGNATURE-
Re: Wget and Yahoo login?
And you'll probably have to do this again- I bet yahoo expires the session cookies! On Tue, Sep 9, 2008 at 2:18 PM, Donald Allen [EMAIL PROTECTED] wrote: After surprisingly little struggle, I got Plan B working -- logged into yahoo with wget, saved the cookies, including session cookies, and then proceeded to fetch pages using the saved cookies. Those pages came back logged in as me, with my customizations. Thanks to Tony, Daniel, and Micah -- you all provided critical advice in solving this problem. /Don On Tue, Sep 9, 2008 at 2:21 PM, Donald Allen [EMAIL PROTECTED] wrote: On Tue, Sep 9, 2008 at 1:51 PM, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Donald Allen wrote: I am doing the yahoo session login with firefox, not with wget, so I'm using the first and easier of your two suggested methods. I'm guessing you are thinking that I'm trying to login to the yahoo session with wget, and thus --keep-session-cookies and --save-cookies=foo.txt would make perfect sense to me, but that's not what I'm doing (yet -- if I'm right about what's happening here, I'm going to have to resort to this). But using firefox to initiate the session, it looks to me like wget never gets to see the session cookies because I don't think firefox writes them to its cookie file (which actually makes sense -- if they only need to live as long as the session, why write them out?). Yes, and I understood this; the thing is, that if session cookies are involved (i.e., cookies that are marked for immediate expiration and are not meant to be saved to the cookies file), then I don't see how you have much choice other than to use the harder method, or else to fake the session cookies by manually inserting them to your cookies file or whatnot (not sure how well that may be expected to work). Or, yeah, add an explicit --header 'Cookie: ...'. Ah, the misunderstanding was that the stuff you thought I missed was intended to push me in the direction of Plan B -- log in to yahoo with wget. Yes; and that's entirely my fault, as I didn't explicitly say that. No problem. I understand now. I'll look at trying to make this work. Thanks for all the help, though I can't guarantee that you are done yet :-) But, hopefully, this exchange will benefit others. I was actually surprised you kept going after I pointed out that it required the Accept-Encoding header that results in gzipped content. That didn't faze me because the pages I'm after will be processed by a python program, so having to gunzip would not require a manual step. This behavior is a little surprising to me from Yahoo!. It's not surprising in _general_, but for a site that really wants to be as accessible as possible (I would think?), insisting on the latest browsers seems ill-advised. Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape, visit a site, and get a server-generated page that's empty other than the phrase You're not using Internet Explorer. :p And taking it one step further, I'm greatly enjoying watching Microsoft thrash around, trying to save themselves, which I don't think they will. Perhaps they'll re-invent themselves, as IBM did, but their cash cow is not going to produce milk too much longer. I've just installed the Chrome beta on the Windows side of one of my machines (I grudgingly give it 10 Gb on each machine; Linux gets the rest), and it looks very, very nice. They've still got work to do, but they appear to be heading in a very good direction. These are smart people at Google. All signs seem to be pointing towards more and more computing happening on the server side in the coming years. /Don - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik 3HbbATyqnrm0hAJXqNTqpl4= =3XD/ -END PGP SIGNATURE- -- Best Regards. Please keep in touch. This is unedited. P-)
Re: Wget and Yahoo login?
On Mon, 8 Sep 2008, Donald Allen wrote: The page I get is what would be obtained if an un-logged-in user went to the specified url. Opening that same url in Firefox *does* correctly indicate that it is logged in as me and reflects my customizations. First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts need. Then you read the capure and replay them as closely as possible using your tool. As you will find out, sites like this use all sorts of funny tricks to figure out you and to make it hard to automate what you're trying to do. They tend to use javascripts for redirects and for fiddling with cookies just to make sure you have a javascript and cookie enabled browser. So you need to work hard(er) when trying this with non-browsers. It's certainly still possible, even without using the browser to get the first cookie file. But it may take some effort. -- / daniel.haxx.se
Re: Wget and Yahoo login?
On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg [EMAIL PROTECTED] wrote: On Mon, 8 Sep 2008, Donald Allen wrote: The page I get is what would be obtained if an un-logged-in user went to the specified url. Opening that same url in Firefox *does* correctly indicate that it is logged in as me and reflects my customizations. First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts need. Then you read the capure and replay them as closely as possible using your tool. As you will find out, sites like this use all sorts of funny tricks to figure out you and to make it hard to automate what you're trying to do. They tend to use javascripts for redirects and for fiddling with cookies just to make sure you have a javascript and cookie enabled browser. So you need to work hard(er) when trying this with non-browsers. It's certainly still possible, even without using the browser to get the first cookie file. But it may take some effort. I have not been able to retrieve a page with wget as if I were logged in using --load-cookies and Micah's suggestion about 'Accept-Encoding' (there was a typo in his message -- it's 'Accept-Encoding', not 'Accept-Encodings'). I did install livehttpheaders and tried --no-cookies and --header cookie info from livehttpheaders and that did work. Some of the cookie info sent by Firefox was a mystery, because it's not in the cookie file. Perhaps that's the crucial difference -- I'm speculating that wget isn't sending quite the same thing as Firefox when --load-cookies is used, because Firefox is adding stuff that isn't in the cookie file. Just a guess. Is there a way to ask wget to print the headers it sends (ala livehttpheaders)? I've looked through the options on the man page and didn't see anything, though I might have missed it. -- / daniel.haxx.se
Re: Wget and Yahoo login?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg [EMAIL PROTECTED] wrote: On Mon, 8 Sep 2008, Donald Allen wrote: The page I get is what would be obtained if an un-logged-in user went to the specified url. Opening that same url in Firefox *does* correctly indicate that it is logged in as me and reflects my customizations. First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts need. Then you read the capure and replay them as closely as possible using your tool. As you will find out, sites like this use all sorts of funny tricks to figure out you and to make it hard to automate what you're trying to do. They tend to use javascripts for redirects and for fiddling with cookies just to make sure you have a javascript and cookie enabled browser. So you need to work hard(er) when trying this with non-browsers. It's certainly still possible, even without using the browser to get the first cookie file. But it may take some effort. I have not been able to retrieve a page with wget as if I were logged in using --load-cookies and Micah's suggestion about 'Accept-Encoding' (there was a typo in his message -- it's 'Accept-Encoding', not 'Accept-Encodings'). I did install livehttpheaders and tried --no-cookies and --header cookie info from livehttpheaders and that did work. That's how I did it as well (except I got the headers from tcpdump); I'm using Firefox 3, so don't have access to FF's new sqllite-based cookies file (apart from the patch at http://wget.addictivecode.org/FrontPage?action=AttachFiledo=viewtarget=wget-firefox3-cookie.patch). Some of the cookie info sent by Firefox was a mystery, because it's not in the cookie file. Perhaps that's the crucial difference -- I'm speculating that wget isn't sending quite the same thing as Firefox when --load-cookies is used, because Firefox is adding stuff that isn't in the cookie file. Just a guess. Probably there are session cookies involved, that are sent in the first page, that you're not sending back with the form submit. - --keep-session-cookies and --save-cookies=foo.txt make a good combination. Is there a way to ask wget to print the headers it sends (ala livehttpheaders)? I've looked through the options on the man page and didn't see anything, though I might have missed it. - --debug - -- HTH, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxqL77M8hyUobTrERAovFAJ9yagS2xW+2wFG65BwiFkJNfTMylgCfYaq7 1vOmTDimFg8E7Cn+Q+HGZn8= =JKXH -END PGP SIGNATURE-
Re: Wget and Yahoo login?
On Tue, Sep 9, 2008 at 12:23 PM, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg [EMAIL PROTECTED] wrote: On Mon, 8 Sep 2008, Donald Allen wrote: The page I get is what would be obtained if an un-logged-in user went to the specified url. Opening that same url in Firefox *does* correctly indicate that it is logged in as me and reflects my customizations. First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts need. Then you read the capure and replay them as closely as possible using your tool. As you will find out, sites like this use all sorts of funny tricks to figure out you and to make it hard to automate what you're trying to do. They tend to use javascripts for redirects and for fiddling with cookies just to make sure you have a javascript and cookie enabled browser. So you need to work hard(er) when trying this with non-browsers. It's certainly still possible, even without using the browser to get the first cookie file. But it may take some effort. I have not been able to retrieve a page with wget as if I were logged in using --load-cookies and Micah's suggestion about 'Accept-Encoding' (there was a typo in his message -- it's 'Accept-Encoding', not 'Accept-Encodings'). I did install livehttpheaders and tried --no-cookies and --header cookie info from livehttpheaders and that did work. That's how I did it as well (except I got the headers from tcpdump); I'm using Firefox 3, so don't have access to FF's new sqllite-based cookies file (apart from the patch at http://wget.addictivecode.org/FrontPage?action=AttachFiledo=viewtarget=wget-firefox3-cookie.patch ). Some of the cookie info sent by Firefox was a mystery, because it's not in the cookie file. Perhaps that's the crucial difference -- I'm speculating that wget isn't sending quite the same thing as Firefox when --load-cookies is used, because Firefox is adding stuff that isn't in the cookie file. Just a guess. Probably there are session cookies involved, that are sent in the first page, that you're not sending back with the form submit. - --keep-session-cookies and --save-cookies=foo.txt make a good combination. Is there a way to ask wget to print the headers it sends (ala livehttpheaders)? I've looked through the options on the man page and didn't see anything, though I might have missed it. - --debug Well, I rebuilt my wget with the 'debug' use flag and ran it on the yahoo test page (after having logged in to yahoo with firefox, of course) with --load-cookies and the accept-encoding header item, with --debug. Very useful. wget is sending every cookie item in firefox's cookies.txt. But firefox sends three additional cookie items in the header that wget does not send. Those items are *not* in firefox's cookies.txt so wget has no way of knowing about them. Is it possible that firefox is not writing session cookies to the file? The result of this test, just to be clear, was a page that indicated yahoo thought I was not logged in. Those extra items firefox is sending appear to be the difference, because I included them (from the livehttpheaders output) when I tried sending the cookies manually with --header, I got the same page back with wget that indicated that yahoo knew I was logged in and formatted with page with my preferences. /Don - -- HTH, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxqL77M8hyUobTrERAovFAJ9yagS2xW+2wFG65BwiFkJNfTMylgCfYaq7 1vOmTDimFg8E7Cn+Q+HGZn8= =JKXH -END PGP SIGNATURE-
Re: Wget and Yahoo login?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: The result of this test, just to be clear, was a page that indicated yahoo thought I was not logged in. Those extra items firefox is sending appear to be the difference, because I included them (from the livehttpheaders output) when I tried sending the cookies manually with --header, I got the same page back with wget that indicated that yahoo knew I was logged in and formatted with page with my preferences. Perhaps you missed this in my last message: Probably there are session cookies involved, that are sent in the first page, that you're not sending back with the form submit. --keep-session-cookies and --save-cookies=foo.txt make a good combination. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxrJ17M8hyUobTrERAvdsAJ9XEwMfimHXRUXKtV66P+YsG+tA7gCfWKbq nCqAmXJfU3kTncMQkKk0JZo= =17Yr -END PGP SIGNATURE-
Re: Wget and Yahoo login?
On Tue, Sep 9, 2008 at 1:29 PM, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: The result of this test, just to be clear, was a page that indicated yahoo thought I was not logged in. Those extra items firefox is sending appear to be the difference, because I included them (from the livehttpheaders output) when I tried sending the cookies manually with --header, I got the same page back with wget that indicated that yahoo knew I was logged in and formatted with page with my preferences. Perhaps you missed this in my last message: Probably there are session cookies involved, that are sent in the first page, that you're not sending back with the form submit. --keep-session-cookies and --save-cookies=foo.txt make a good combination. I think we're mis-communicating, easily my fault, since I know just enough about this stuff to be dangerous. I am doing the yahoo session login with firefox, not with wget, so I'm using the first and easier of your two suggested methods. I'm guessing you are thinking that I'm trying to login to the yahoo session with wget, and thus --keep-session-cookies and --save-cookies=foo.txt would make perfect sense to me, but that's not what I'm doing (yet -- if I'm right about what's happening here, I'm going to have to resort to this). But using firefox to initiate the session, it looks to me like wget never gets to see the session cookies because I don't think firefox writes them to its cookie file (which actually makes sense -- if they only need to live as long as the session, why write them out?). /Don - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxrJ17M8hyUobTrERAvdsAJ9XEwMfimHXRUXKtV66P+YsG+tA7gCfWKbq nCqAmXJfU3kTncMQkKk0JZo= =17Yr -END PGP SIGNATURE-
Re: Wget and Yahoo login?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: I am doing the yahoo session login with firefox, not with wget, so I'm using the first and easier of your two suggested methods. I'm guessing you are thinking that I'm trying to login to the yahoo session with wget, and thus --keep-session-cookies and --save-cookies=foo.txt would make perfect sense to me, but that's not what I'm doing (yet -- if I'm right about what's happening here, I'm going to have to resort to this). But using firefox to initiate the session, it looks to me like wget never gets to see the session cookies because I don't think firefox writes them to its cookie file (which actually makes sense -- if they only need to live as long as the session, why write them out?). Yes, and I understood this; the thing is, that if session cookies are involved (i.e., cookies that are marked for immediate expiration and are not meant to be saved to the cookies file), then I don't see how you have much choice other than to use the harder method, or else to fake the session cookies by manually inserting them to your cookies file or whatnot (not sure how well that may be expected to work). Or, yeah, add an explicit --header 'Cookie: ...'. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxrVD7M8hyUobTrERAt19AJ9bmmczCKjzMtGCoXb8B5g25uMLRQCeK8qh M57W3Reqj+/pO8GuDwb9Nok= =ajp/ -END PGP SIGNATURE-
Re: Wget and Yahoo login?
On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: I am doing the yahoo session login with firefox, not with wget, so I'm using the first and easier of your two suggested methods. I'm guessing you are thinking that I'm trying to login to the yahoo session with wget, and thus --keep-session-cookies and --save-cookies=foo.txt would make perfect sense to me, but that's not what I'm doing (yet -- if I'm right about what's happening here, I'm going to have to resort to this). But using firefox to initiate the session, it looks to me like wget never gets to see the session cookies because I don't think firefox writes them to its cookie file (which actually makes sense -- if they only need to live as long as the session, why write them out?). Yes, and I understood this; the thing is, that if session cookies are involved (i.e., cookies that are marked for immediate expiration and are not meant to be saved to the cookies file), then I don't see how you have much choice other than to use the harder method, or else to fake the session cookies by manually inserting them to your cookies file or whatnot (not sure how well that may be expected to work). Or, yeah, add an explicit --header 'Cookie: ...'. Ah, the misunderstanding was that the stuff you thought I missed was intended to push me in the direction of Plan B -- log in to yahoo with wget. I understand now. I'll look at trying to make this work. Thanks for all the help, though I can't guarantee that you are done yet :-) But, hopefully, this exchange will benefit others. /Don - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxrVD7M8hyUobTrERAt19AJ9bmmczCKjzMtGCoXb8B5g25uMLRQCeK8qh M57W3Reqj+/pO8GuDwb9Nok= =ajp/ -END PGP SIGNATURE-
Re: Wget and Yahoo login?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Donald Allen wrote: I am doing the yahoo session login with firefox, not with wget, so I'm using the first and easier of your two suggested methods. I'm guessing you are thinking that I'm trying to login to the yahoo session with wget, and thus --keep-session-cookies and --save-cookies=foo.txt would make perfect sense to me, but that's not what I'm doing (yet -- if I'm right about what's happening here, I'm going to have to resort to this). But using firefox to initiate the session, it looks to me like wget never gets to see the session cookies because I don't think firefox writes them to its cookie file (which actually makes sense -- if they only need to live as long as the session, why write them out?). Yes, and I understood this; the thing is, that if session cookies are involved (i.e., cookies that are marked for immediate expiration and are not meant to be saved to the cookies file), then I don't see how you have much choice other than to use the harder method, or else to fake the session cookies by manually inserting them to your cookies file or whatnot (not sure how well that may be expected to work). Or, yeah, add an explicit --header 'Cookie: ...'. Ah, the misunderstanding was that the stuff you thought I missed was intended to push me in the direction of Plan B -- log in to yahoo with wget. Yes; and that's entirely my fault, as I didn't explicitly say that. I understand now. I'll look at trying to make this work. Thanks for all the help, though I can't guarantee that you are done yet :-) But, hopefully, this exchange will benefit others. I was actually surprised you kept going after I pointed out that it required the Accept-Encoding header that results in gzipped content. This behavior is a little surprising to me from Yahoo!. It's not surprising in _general_, but for a site that really wants to be as accessible as possible (I would think?), insisting on the latest browsers seems ill-advised. Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape, visit a site, and get a server-generated page that's empty other than the phrase You're not using Internet Explorer. :p - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik 3HbbATyqnrm0hAJXqNTqpl4= =3XD/ -END PGP SIGNATURE-
Re: Wget and Yahoo login?
On Tue, Sep 9, 2008 at 1:51 PM, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Donald Allen wrote: I am doing the yahoo session login with firefox, not with wget, so I'm using the first and easier of your two suggested methods. I'm guessing you are thinking that I'm trying to login to the yahoo session with wget, and thus --keep-session-cookies and --save-cookies=foo.txt would make perfect sense to me, but that's not what I'm doing (yet -- if I'm right about what's happening here, I'm going to have to resort to this). But using firefox to initiate the session, it looks to me like wget never gets to see the session cookies because I don't think firefox writes them to its cookie file (which actually makes sense -- if they only need to live as long as the session, why write them out?). Yes, and I understood this; the thing is, that if session cookies are involved (i.e., cookies that are marked for immediate expiration and are not meant to be saved to the cookies file), then I don't see how you have much choice other than to use the harder method, or else to fake the session cookies by manually inserting them to your cookies file or whatnot (not sure how well that may be expected to work). Or, yeah, add an explicit --header 'Cookie: ...'. Ah, the misunderstanding was that the stuff you thought I missed was intended to push me in the direction of Plan B -- log in to yahoo with wget. Yes; and that's entirely my fault, as I didn't explicitly say that. No problem. I understand now. I'll look at trying to make this work. Thanks for all the help, though I can't guarantee that you are done yet :-) But, hopefully, this exchange will benefit others. I was actually surprised you kept going after I pointed out that it required the Accept-Encoding header that results in gzipped content. That didn't faze me because the pages I'm after will be processed by a python program, so having to gunzip would not require a manual step. This behavior is a little surprising to me from Yahoo!. It's not surprising in _general_, but for a site that really wants to be as accessible as possible (I would think?), insisting on the latest browsers seems ill-advised. Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape, visit a site, and get a server-generated page that's empty other than the phrase You're not using Internet Explorer. :p And taking it one step further, I'm greatly enjoying watching Microsoft thrash around, trying to save themselves, which I don't think they will. Perhaps they'll re-invent themselves, as IBM did, but their cash cow is not going to produce milk too much longer. I've just installed the Chrome beta on the Windows side of one of my machines (I grudgingly give it 10 Gb on each machine; Linux gets the rest), and it looks very, very nice. They've still got work to do, but they appear to be heading in a very good direction. These are smart people at Google. All signs seem to be pointing towards more and more computing happening on the server side in the coming years. /Don - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik 3HbbATyqnrm0hAJXqNTqpl4= =3XD/ -END PGP SIGNATURE-
Re: Wget and Yahoo login?
After surprisingly little struggle, I got Plan B working -- logged into yahoo with wget, saved the cookies, including session cookies, and then proceeded to fetch pages using the saved cookies. Those pages came back logged in as me, with my customizations. Thanks to Tony, Daniel, and Micah -- you all provided critical advice in solving this problem. /Don On Tue, Sep 9, 2008 at 2:21 PM, Donald Allen [EMAIL PROTECTED] wrote: On Tue, Sep 9, 2008 at 1:51 PM, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Donald Allen wrote: I am doing the yahoo session login with firefox, not with wget, so I'm using the first and easier of your two suggested methods. I'm guessing you are thinking that I'm trying to login to the yahoo session with wget, and thus --keep-session-cookies and --save-cookies=foo.txt would make perfect sense to me, but that's not what I'm doing (yet -- if I'm right about what's happening here, I'm going to have to resort to this). But using firefox to initiate the session, it looks to me like wget never gets to see the session cookies because I don't think firefox writes them to its cookie file (which actually makes sense -- if they only need to live as long as the session, why write them out?). Yes, and I understood this; the thing is, that if session cookies are involved (i.e., cookies that are marked for immediate expiration and are not meant to be saved to the cookies file), then I don't see how you have much choice other than to use the harder method, or else to fake the session cookies by manually inserting them to your cookies file or whatnot (not sure how well that may be expected to work). Or, yeah, add an explicit --header 'Cookie: ...'. Ah, the misunderstanding was that the stuff you thought I missed was intended to push me in the direction of Plan B -- log in to yahoo with wget. Yes; and that's entirely my fault, as I didn't explicitly say that. No problem. I understand now. I'll look at trying to make this work. Thanks for all the help, though I can't guarantee that you are done yet :-) But, hopefully, this exchange will benefit others. I was actually surprised you kept going after I pointed out that it required the Accept-Encoding header that results in gzipped content. That didn't faze me because the pages I'm after will be processed by a python program, so having to gunzip would not require a manual step. This behavior is a little surprising to me from Yahoo!. It's not surprising in _general_, but for a site that really wants to be as accessible as possible (I would think?), insisting on the latest browsers seems ill-advised. Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape, visit a site, and get a server-generated page that's empty other than the phrase You're not using Internet Explorer. :p And taking it one step further, I'm greatly enjoying watching Microsoft thrash around, trying to save themselves, which I don't think they will. Perhaps they'll re-invent themselves, as IBM did, but their cash cow is not going to produce milk too much longer. I've just installed the Chrome beta on the Windows side of one of my machines (I grudgingly give it 10 Gb on each machine; Linux gets the rest), and it looks very, very nice. They've still got work to do, but they appear to be heading in a very good direction. These are smart people at Google. All signs seem to be pointing towards more and more computing happening on the server side in the coming years. /Don - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik 3HbbATyqnrm0hAJXqNTqpl4= =3XD/ -END PGP SIGNATURE-
Re: Wget and Yahoo login?
2008/9/8 Tony Godshall [EMAIL PROTECTED]: I haven't done this but I can speculate that you need to have wget identify itself as firefox. When I read this, I thought it looked promising, but it doesn't work. I tried sending exactly the user-agent string firefox is sending and still got a page from yahoo that clearly indicates yahoo thinks I'm not logged in. /Don Quote from man wget... -U agent-string --user-agent=agent-string Identify as agent-string to the HTTP server. The HTTP protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the WWW software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as Wget/version, version being the current ver‐ sion number of Wget. However, some sites have been known to impose the policy of tailoring the output according to the User-Agent-supplied information. While this is not such a bad idea in theory, it has been abused by servers denying information to clients other than (historically) Netscape or, more fre‐ quently, Microsoft Internet Explorer. This option allows you to change the User-Agent line issued by Wget. Use of this option is discouraged, unless you really know what you are doing. On Mon, Sep 8, 2008 at 12:25 PM, Donald Allen [EMAIL PROTECTED] wrote: There was a recent discussion concerning using wget to obtain pages from yahoo logged into yahoo as a particular user. Micah replied to Rick Nakroshis with instructions describing two methods for doing this. This information has also been added by Micah to the wiki. I just tried the simpler of the two methods -- logging into yahoo with my browser (Firefox 2.0.0.16) and then downloading a page with wget --output-document=/tmp/yahoo/yahoo.htm --load-cookies my home directory/.mozilla/firefox/id2dmo7r.default/cookies.txt 'http://yahoo url' The page I get is what would be obtained if an un-logged-in user went to the specified url. Opening that same url in Firefox *does* correctly indicate that it is logged in as me and reflects my customizations. wget -V: GNU Wget 1.11.1 I am running a reasonably up-to-date Gentoo system (updated within the last month) on a Thinkpad X61. Have I missed something here? Any help will be appreciated. Please include my personal address in your replies as I am not (yet) a subscriber to this list. Thanks -- /Don Allen -- Best Regards. Please keep in touch. This is unedited. P-)
Re: Wget and Yahoo login?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Donald Allen wrote: There was a recent discussion concerning using wget to obtain pages from yahoo logged into yahoo as a particular user. Micah replied to Rick Nakroshis with instructions describing two methods for doing this. This information has also been added by Micah to the wiki. I just tried the simpler of the two methods -- logging into yahoo with my browser (Firefox 2.0.0.16) and then downloading a page with wget --output-document=/tmp/yahoo/yahoo.htm --load-cookies my home directory/.mozilla/firefox/id2dmo7r.default/cookies.txt 'http://yahoo url' The page I get is what would be obtained if an un-logged-in user went to the specified url. Opening that same url in Firefox *does* correctly indicate that it is logged in as me and reflects my customizations. Are you signing into the main Yahoo! site? When I try to do so, whether I use the cookies or no, I get a message about update your browser to something more modern or the like. The difference appears to be a combination of _both_ User-Agent (as you've done), _and_ --header Accept-Encodings: gzip,deflate. This plus appropriate cookies gets me a decent logged-in page, but of course it's gzip-compressed. Since Wget doesn't currently support gzip-decoding and the like, that makes the use of Wget in this situation cumbersome. Support for something like this probably won't be seen until 1.13 or 1.14, I'm afraid. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxdw77M8hyUobTrERAi/QAJ0atPMeUQ/0YCNwAP+XiH4nDyvclwCcDxYo obud0CjpATBYDvA0eS3ZHGY= =vv4R -END PGP SIGNATURE-
Re: [wget-notify] add a new option
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 houda hocine wrote: Hi, Hi houda. This message was sent to the wget-notify, which was not the proper forum. Wget-notify is reserved for bug-change and (previously) commit notifications, and is not intended for discussion (though I obviously haven't blocked discussions; the original intent was to be able to discuss commits, but I'm not sure I need to allow discussions any more, so it may be disallowed soon). The appropriate list would be wget@sunsite.dk, to which this discussion has been redirected. we create a new format for archiviving (. warc), and we want to ensure that wget generate directly this format from the input url . You can help me by some ideas to achieve this new option? The format is (warc -wget url) I am in the process of trying to understand the source code to add this new option. Which .c file fallows me to do this? Doing this is not likely to be a trivial undertaking: the current file-output interface isn't really abstracted enough to allow this, so basically you'll need to modify most of the existing .c files. We are hoping at some future point to allow for a more generic output format, for direct output to (for instance) tarballs and .mhtml archives. At that point, it'd probably be fairly easy to write extensions to do what you want. In the meantime, though, it'll be a pain in the butt. I can't really offer much help; the best way to understand the source is to read and explore it. However, on the general topic of adding new options to Wget, Tony Lewis has written the excellent guide at http://wget.addictivecode.org/OptionsHowto. Hope that helps! Please note that I won't likely be entertaining patches to Wget to make it output to non-mainstream archive formats, and even once generic output mechanisms are supported, the mainstream archive formats will most likely be supported as extension plugins or similar, and not as built-in support within Wget. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIvbyf7M8hyUobTrERApl8AJwNvWOdDd0Z//wbNzN/jyZFqKI5iQCfQOx4 3zlxPGaVqjsPhwa7ZwB4wrs= =Zy+N -END PGP SIGNATURE-
Re: Wget function
Hello, First of all i would thank you for your great tool I have a request i use this function to save url with absolute link so it's very good wget -k http://www.google.fr/ but i want to save this file as other name than index.html like for example google-is-good.html i have try this wget -k –output-document=google-is-good.html http://www.google.fr/ is work except i lost absolute link and it's terrible i don't know how to fix this problem wich combinaison i have to made for use wget - k with another name ?? can you help me i don't find the solution also where i can find last verssion for windows thank you for your time regard carlos .
Re: Wget function
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 karlito wrote: Hello, First of all i would thank you for your great tool I have a request i use this function to save url with absolute link so it's very good wget -k http://www.google.fr/ but i want to save this file as other name than index.html like for example google-is-good.html i have try this wget -k –output-document=google-is-good.html http://www.google.fr/ is work except i lost absolute link and it's terrible Yeah. Conversions won't work with --output-document, which behaves rather like a shell redirection. i don't know how to fix this problem wich combinaison i have to made for use wget - k with another name ?? You could always rename it afterwards. In your specific case, the current development sources (which will become Wget 1.12) have a --default-page=google-is-good.html option for specifying the default page name, thanks to Joao Ferreira. It's not yet available in any release. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIsv3N7M8hyUobTrERAskoAJ4lHZK+VEBWYuFzOtbd57wEEvYm0wCdEVSK el6v3e0TkKpQtOG2b5ZiHcI= =/+sB -END PGP SIGNATURE-
Re: WGET :: [Correction de texte]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Tom wrote: Téléchargement récursif: -r, --recursive spécifer un téléchargement récursif. -l, --level=NOMBRE _*profondeeur*_ maximale de récursion (inf ou 0 pour infini). Juste un e à enlever de profondeeur, et ca sera réglé ! This issue appears to have been fixed with the latest French translation. It will be released with Wget 1.12. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIswBE7M8hyUobTrERAufeAKCIl4ghMvo2JolNfsSAYCTd92v9OwCfS89O iT3urRXKctZuucXnOn9tGLc= =v5SC -END PGP SIGNATURE-
Re: Wget function
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Please keep the list in the replies. karlito wrote: hi thank you for the reply my problem can be fixed on the next verssion ? because it's for batch i have more 1000 url to made so is that why i need to find a solution also when you mean rename what is the function to rename with wget ? I mean, just use the mv or rename command on your operating system. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIswfR7M8hyUobTrERAubkAJ0VL2UPnNQtD27waPVwFkeUwbUp9wCfXerh dZBr4e7ZBKcEE5Kzrjv1mi8= =GoKL -END PGP SIGNATURE-
Re: wget and wiki crawling
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 asm c wrote: I've recently been using wget, and got it working for the most part, but there's one issue that's really been bugging me. One of the parameters I use is '-R *action=*,*oldid=*' (side note on the platform: ZSH on NetBSD on the SDF public access unix system, although I've also used it on windows with the same result). The purpose of this parameter is so that, when wget crawls a mid-sized wiki I'd like to have a local copy of, it doesn't bother with all the history pages, edit pages, and so forth. Not downloading these would save me an enormous amount of time. Unfortunately, the parameter is ignored until after the php page is downloaded. So, because it waits until it's downloaded to delete it, using the param doesn't really help at all. Does anyone know how I can stop wget from even downloading matching pages? Well, you don't mention it, but I'll assume that those patterns occur in the query string portion of the URL: that is, they follow a question mark (?) that appears at some point. Unfortunately, the -R and -A options only apply to the filename portion of the URL: that is, whatever falls between the first question mark, and the first preceding slash (/). Confusingly, it is also then applied _after_ files are downloaded, to determine whether they should be deleted after the fact: so Wget probably downloads those files you really wish it wouldn't, and then deletes them afterwards anyway. Worse, there's no way around this, currently. This is part of a suite of problems that are currently slated to be addressed soon. The most pertinent to your problem, though, is the need for a way to match against query strings. I'm very much hoping to get around to this before the next major Wget release, version 1.12. It's being tracked here: https://savannah.gnu.org/bugs/index.php?22089 If you add yourself to the Cc list, you'll be able to follow along on its progress. - -- Cheers! Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIr55d7M8hyUobTrERAu4KAJsHmDTZ46ioEGOTprdE/aTGrj853QCfet84 +c+npJnPwC/86/rLpn5rB8s= =abdv -END PGP SIGNATURE-
RE: Wget and Yahoo login?
Micah Cowan wrote: The easiest way to do what you want may be to log in using your browser, and then tell Wget to use the cookies from your browser, using Given the frequency of the login and then download a file use case , it should probably be documented on the wiki. (Perhaps it already is. :-) Also, it would probably be helpful to have a shell script to automate this. Tony
Re: Wget and Yahoo login?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Tony Lewis wrote: Micah Cowan wrote: The easiest way to do what you want may be to log in using your browser, and then tell Wget to use the cookies from your browser, using Given the frequency of the login and then download a file use case , it should probably be documented on the wiki. (Perhaps it already is. :-) Yeah, at http://wget.addictivecode.org/FrequentlyAskedQuestions#password-protected I think you missed the final sentence of my how-to: (I'm going to put this up on the Wgiki Faq now, at http://wget.addictivecode.org/FrequentlyAskedQuestions) :) (Back to you:) Also, it would probably be helpful to have a shell script to automate this. I filed the following issue some time ago: https://savannah.gnu.org/bugs/index.php?22561 The report is low on details; but I was envisioning something that would spew out forms and their fields, accept values for fields in one form, and invoke the appropriate Wget command to do the submission. I don't know if it could be _completely_ automated, since it's not 100% possible for the script to know which form fields are the ones it should be filling out. OTOH, there are some damn good heuristics that could be done: I imagine that the right form (in the event of more than one) can usually be guessed by seeing which one has a password-type input (assuming there's also only one of those). If that form has only one text-type input, then we've found the username field as well. Name-based heuristics (with pass, user, uname, login, etc) could also help. If someone wants to do this, that'd be terrific. Could probably reuse the existing HTML parser code from Wget. Otherwise, it'd probably be a while before I could get to it, since I've got higher priorities that have been languishing. Such a tool might also be an appropriate place to add FF3 sqllite cookies support. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIrb0s7M8hyUobTrERAlVXAJ9YnAM7JiQrxrB/KclA1FXDnoVswgCdGO7t Vaa98nhNRuEY4aLMx2BFXm0= =ScoA -END PGP SIGNATURE-
Re: Wget and Yahoo login?
At 04:27 PM 8/10/2008, you wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Rick Nakroshis wrote: Micah, If you will excuse a quick question about Wget, I'm trying to find out if I can use it to download a page from Yahoo that requires me to be logged in using my Yahoo profile name and password. It's a display of a CSV file, and the only wrinkle is trying to get past the Yahoo login. Try as I may, I just can't seem to find anything about Wget and Yahoo. Any suggestions or pointers? Hi Rick, In the future, it's better if you post questions to the mailing list at wget@sunsite.dk; I don't always have time to respond. The easiest way to do what you want may be to log in using your browser, and then tell Wget to use the cookies from your browser, using - --load-cookies=path-to-browser's-cookies. Of course, this only works if your browser saves its cookies in the standard text format (Firefox prior to version 3 will do this), or can export to that format (note that someone contributed a patch to allow Wget to work with Firefox 3 cookies; it's linked from http://wget.addictivecode.org/, it's unoffocial so I can't vouch for its quality). Otherwise, you can perform the login using Wget, saving the cookies to a file of your choice, using --post-data=..., --save-cookies=cookies.txt, and probably --keep-session-cookies. This will require that you know what data to place in --post-data, which generally requires that you dig around in the HTML to find the right form field names, and where to post them. For instance, if you find a form like the following within the page containing the log-in form: form action=/doLogin.php method=POST input type=text name=s-login input type=password name=s-pass /form then you need to do something like: $ wget --post-data='s-login=USERNAMEs-pass=PASSWORD' \ --save-cookies=my-cookies.txt --keep-session-cookies \ http://HOSTNAME/doLogin.php (Note that you _don't_ necessarily send the information to the page that had the login page: you send it to the spot mentioned in the action attribute of the password form.) Once this is done, you _should_ be able to perform further operations with Wget as if you're logged in, by using $ wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \ --keep-session-cookies ... (I'm going to put this up on the Wgiki Faq now, at http://wget.addictivecode.org/FrequentlyAskedQuestions) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIn09A7M8hyUobTrERAu04AJ9EgRoBBhvNCDwOt87f91p+HpWktACdFgMM KEfliBtfrPBbh/XdvusEPiw= =qlGZ -END PGP SIGNATURE- Micah, Thank you for taking the time to answer so thoroughly, and doing so promptly, too. You've given me a great boost forward, and I appreciate it. Thank you, sir! Rick
Re: WGET :: [Correction de texte]
Bonjour Tom, Merci de cette information. Mais pourrais tu nous préciser de quelle version de Wget il s'agit? Tu peux obtenir cette information avec: wget --version Je te recommande la dernière version de Wget, disponible ici: http://wget.addictivecode.org/FrequentlyAskedQuestions#download Aussi, la langue de cette liste de diffusion est l'Anglais. Merci, Julien. Hi Tom, Thanks for this information. But, could you tell us what version of Wget are you using? You can see that using: wget --version I advise you to try the last version, available here: http://wget.addictivecode.org/FrequentlyAskedQuestions#download Moreover, the language of this mailing list is English. Thanks, Julien. 2008/8/11 Tom [EMAIL PROTECTED]: Bonjour ! Je souhaite vous informer d'une touche restée appuyée un quart de seconde trop longtemps semble-t-il ! Dans l'aide de Wget (wget --help), nous trouvons en effet : Téléchargement récursif: -r, --recursive spécifer un téléchargement récursif. -l, --level=NOMBRE profondeeur maximale de récursion (inf ou 0 pour infini). Juste un e à enlever de profondeeur, et ca sera réglé ! Comme il était indiqué Transmettre toutes anomalies ou suggestions à [EMAIL PROTECTED]., je me suis permis de vous le signaler ! Merci pour cet outil, et bonne continuation ! Cordialement, Tom
Re: WGET :: [Correction de texte]
* Tom ([EMAIL PROTECTED]) wrote: Bonjour ! bonjour, Je souhaite vous informer d'une touche restée appuyée un quart de seconde trop longtemps semble-t-il ! ... Téléchargement récursif: -r, --recursive spécifer un téléchargement récursif. -l, --level=NOMBRE *profondeeur* maximale de récursion (inf ou 0 Juste un e à enlever de profondeeur, et ca sera réglé ! En effet, merci ! Micah, instead of profondeeur it should be profondeur. Where do you forward that info, French GNU translation team ? (./po/fr.po around line 1472) Saint Xavier.
Re: WGET :: [Correction de texte]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Saint Xavier wrote: * Tom ([EMAIL PROTECTED]) wrote: Bonjour ! bonjour, Je souhaite vous informer d'une touche restée appuyée un quart de seconde trop longtemps semble-t-il ! ... Téléchargement récursif: -r, --recursive spécifer un téléchargement récursif. -l, --level=NOMBRE *profondeeur* maximale de récursion (inf ou 0 Juste un e à enlever de profondeeur, et ca sera réglé ! En effet, merci ! Micah, instead of profondeeur it should be profondeur. Where do you forward that info, French GNU translation team ? (./po/fr.po around line 1472) Yup. The mailing address for the French translation team is at [EMAIL PROTECTED] The team page is http://translationproject.org/team/fr.html; other translation teams are listed at http://translationproject.org/team/index.html Looks like it's still present in the latest fr.po file at http://translationproject.org/latest/wget/fr.po - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIoIl77M8hyUobTrERApRkAJsGUybOJEDvYidFXc9OWLJ7gIX66QCeL8we UsjynplN9Um1gmmWUcyZMbU= =lqbw -END PGP SIGNATURE-
Re: Wget and Yahoo login?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Rick Nakroshis wrote: Micah, If you will excuse a quick question about Wget, I'm trying to find out if I can use it to download a page from Yahoo that requires me to be logged in using my Yahoo profile name and password. It's a display of a CSV file, and the only wrinkle is trying to get past the Yahoo login. Try as I may, I just can't seem to find anything about Wget and Yahoo. Any suggestions or pointers? Hi Rick, In the future, it's better if you post questions to the mailing list at wget@sunsite.dk; I don't always have time to respond. The easiest way to do what you want may be to log in using your browser, and then tell Wget to use the cookies from your browser, using - --load-cookies=path-to-browser's-cookies. Of course, this only works if your browser saves its cookies in the standard text format (Firefox prior to version 3 will do this), or can export to that format (note that someone contributed a patch to allow Wget to work with Firefox 3 cookies; it's linked from http://wget.addictivecode.org/, it's unoffocial so I can't vouch for its quality). Otherwise, you can perform the login using Wget, saving the cookies to a file of your choice, using --post-data=..., --save-cookies=cookies.txt, and probably --keep-session-cookies. This will require that you know what data to place in --post-data, which generally requires that you dig around in the HTML to find the right form field names, and where to post them. For instance, if you find a form like the following within the page containing the log-in form: form action=/doLogin.php method=POST input type=text name=s-login input type=password name=s-pass /form then you need to do something like: $ wget --post-data='s-login=USERNAMEs-pass=PASSWORD' \ --save-cookies=my-cookies.txt --keep-session-cookies \ http://HOSTNAME/doLogin.php (Note that you _don't_ necessarily send the information to the page that had the login page: you send it to the spot mentioned in the action attribute of the password form.) Once this is done, you _should_ be able to perform further operations with Wget as if you're logged in, by using $ wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \ --keep-session-cookies ... (I'm going to put this up on the Wgiki Faq now, at http://wget.addictivecode.org/FrequentlyAskedQuestions) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIn09A7M8hyUobTrERAu04AJ9EgRoBBhvNCDwOt87f91p+HpWktACdFgMM KEfliBtfrPBbh/XdvusEPiw= =qlGZ -END PGP SIGNATURE-
Re: WGET Date-Time
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andreas Weller wrote: Hi! I use wget to download files from a ftp server in a bash script. For example: touch last.time wget -nc ftp://[]/*.txt . find -newer last.time This fails if the files on the FTP server are older than my last.time. So I want wget to set file date/time to the local creation time not the server's... How to do this? You can't, currently. This behavior is intended to support Wget's timestamping (-N) functionality. However, I'd accept a patch for an option that disables this. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIm2si7M8hyUobTrERAi9AAJ0f8TUv7TJR6tFsgc4k174rqH6OlgCghCzz xpemaFdQhODIm0SGp7rJSRA= =vDKD -END PGP SIGNATURE-
Re: Wget scriptability
Micah Cowan wrote: Okay, so there's been a lot of thought in the past, regarding better extensibility features for Wget. Things like hooks for adding support for traversal of new Content-Types besides text/html, or adding some form of JavaScript support, or support for MetaLink. Also, support for being able to filter results pre- and post-processing by Wget: for example, being able to do some filtering on the HTML to change how Wget sees it before parsing for links, but without affecting the actual downloaded version; or filtering the links themselves to alter what Wget fetches. However, another thing that's been vaguely itching at me lately, is the fact that Wget's design is not particularly unix-y. Instead of doing one thing, and doing it well, it does a lot of things, some well, some not. It does what various people needed. It wasn't an excercise in writing a unixy utility. It was a program that solved real problems for real people. But the thing everyone loves about Unix and GNU (and certainly the thing that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline paradigm, I have always hated that. With a passion. - The tools themselves, as much as possible, should be written in an easily-hackable scripting language. Python makes a good candidate. Where we want efficiency, we can implement modules in C to do the work. At the time Wget was conceived, that was Tcl's mantra. It failed miserably. :-) How about concentrating on the problems listed in your first paragraph (which is why I quoted it)? Could you show us how would a buch of shell tools solve them? Or how would a librarized Wget solve them? Or how would any other paradigm or architecture or whatever solve them? -- .-. .-.Yes, I am an agent of Satan, but my duties are largely (_ \ / _) ceremonial. | |[EMAIL PROTECTED]
Re: Wget scriptability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dražen Kačar wrote: Micah Cowan wrote: Okay, so there's been a lot of thought in the past, regarding better extensibility features for Wget. Things like hooks for adding support for traversal of new Content-Types besides text/html, or adding some form of JavaScript support, or support for MetaLink. Also, support for being able to filter results pre- and post-processing by Wget: for example, being able to do some filtering on the HTML to change how Wget sees it before parsing for links, but without affecting the actual downloaded version; or filtering the links themselves to alter what Wget fetches. However, another thing that's been vaguely itching at me lately, is the fact that Wget's design is not particularly unix-y. Instead of doing one thing, and doing it well, it does a lot of things, some well, some not. It does what various people needed. It wasn't an excercise in writing a unixy utility. It was a program that solved real problems for real people. But the thing everyone loves about Unix and GNU (and certainly the thing that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline paradigm, I have always hated that. With a passion. A surprising position from a user of Mutt, whose excellence is due in no small part to its ability to integrate well with other command utilities (that is, to pipeline). The power and flexibility of pipelines is extremely well-established in the Unix world; I feel no need whatsoever to waste breath arguing for it, particularly when you haven't provided the reasons you hate it. For my part, I'm not exaggerating that it's single-handedly responsible for why I'm a Unix/GNU user at all, and why I continue to highly enjoy developing on it. find -name '*.html' -exec sed -i \ 's#http://oldhost/#http://newhost/#g' \; ( cat message; echo; echo '-- '; cat ~/.signature ) | \ gpg --clearsign | mail -s 'Report' [EMAIL PROTECTED] pic | tbl | eqn | eff-ing | troff -ms Each one of these demonstrates the enormously powerful technique of using distinct tools with distinct feature domains, together to form a cohesive solution for the need. The best part is (with the possible exception of the troff pipeline), each of these components are immediately available for use in some other pipeline that does some other completely different function. Note, though, that I don't intend that using Piped-Wget would actually mean the user types in a special pipeline each time he wants to do something with it. The primary driver would read in some config file that would tell wget how it should do the piping. You just tweak the config file when you want to add new functionality. - The tools themselves, as much as possible, should be written in an easily-hackable scripting language. Python makes a good candidate. Where we want efficiency, we can implement modules in C to do the work. At the time Wget was conceived, that was Tcl's mantra. It failed miserably. :-) Are you claiming that Tcl's failure was due to the ability to integrate it with C, rather than its abysmal inadequacy as a programming language (changing it from an ability to integrate with C, to an absolute requirement to do so in order to get anything accomplished)? How about concentrating on the problems listed in your first paragraph (which is why I quoted it)? Could you show us how would a buch of shell tools solve them? Or how would a librarized Wget solve them? Or how would any other paradigm or architecture or whatever solve them? It should be trivially obvious: you plug them in, rather than wait for the Wget developers to get around to implementing it. The thing that both library-ized Wget and pipeline-ized Wget would offer is the same: extreme flexibility. It puts the users in control of what Wget does, rather than just perpetually hearing, sorry, Wget can't do it: you could hack the source, though. :p The difference between the two is that a pipelined Wget offers this flexibility to a wider range of users, whereas a library Wget offers it to C programmers. Or how would you expect to do these things without a library-ized (at least) Wget? Implementing them in the core app (at least by default) is clearly wrong (scope bloat). Giving Wget a plugin architecture is good, but then there's only as much flexibility as there are hooks. Libraryizing Wget is equivalent to providing everything as hooks, and puts the program using it in the driver's seat (and, naturally, there'd be a wrapper implementation, like curl for libcurl). A suite of interconnected utilities does the same, but is more accessible to greater numbers of people. Generally at some expense to efficiency (aren't all flexible architectures?); but Wget isn't CPU-bound, it's network-bound. As mentioned in my original post, this would be a separate project from Wget. Wget would not be going away (though it seems likely to me that it would quickly reach a primarily
Re: wget does not like this URL
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Kevin O'Gorman wrote: Is there a reason i get this: [EMAIL PROTECTED] Pending $ wget -O foo http://www.littlegolem.net/jsp/info/player_game_list_txt.jsp?plid=1107gtid=hex; Cannot specify -r, -p or -N if -O is given. Usage: wget [OPTION]... [URL]... [EMAIL PROTECTED] Pending $ While I do have -O, I don't have the ones it seems to think I've specified. Without the -O foo it works fine, but of course puts the results in a different place. I get the same error message if I use the long-form parameter. You most likely have timestamping=on in your wgetrc. -N and -O were disallowed for version 1.11, but were re-enabled for 1.11.3 (I think) with a warning. The latest version of wget is 1.11.4. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIkf9U7M8hyUobTrERAtkfAJ9g84lMEkzSeLn24cWQA805HZmE8wCfV2Ck bB5RK4lRlcBbwOSiU4jPwxM= =K9cv -END PGP SIGNATURE-
RE: wget-1.11.4 bug
Micah Cowan wrote: The thing is, though, those two threads should be running wgets under separate processes Yes, the two threads are running wgets under seperate processes with system. What operating system are you running? Vista?mipsel-linux with kernel v2.4 built from gcc v3.3.5 Best regards, K.C. Chao _ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE
Re: wget-1.11.4 bug
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 kuang-cheng chao wrote: Dear Micah: Thanks for your work of wget. There is a question about two wgets run simultaneously. In method resolve_bind_address, wget assumes that this is called once. However, this will cause two domain name with the same ip if two wgets run the same method concurrently. Have you reproduced this, or is this in theory? If the latter, what has led you to this conclusion? I don't see anything in the code that would cause this behavior. Also, please use the mailing list for discussions about Wget. I've added it to the recipients list. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIiYKF7M8hyUobTrERAr7fAJ0TnkLdEVOMy6wJA3Z1kIYC7dQoMACfZ9hb x5K6MTzhgVRCdKJwUGnbSRw= =EcFC -END PGP SIGNATURE-
RE: wget-1.11.4 bug
Micah Cowan wrote: Have you reproduced this, or is this in theory? If the latter, what has led you to this conclusion? I don't see anything in the code that would cause this behavior. I reproduce this. But I can't make sure the really problem is in resolve_bind_address. In the attached message, both api.yougotphogo.com and farm1.static.flickr.com get the same ip(74.124.203.218). The two wget are called from two threads of a program. Best regards, k.c. chao p.s. The log is follworing: wget -4 -t 6 http://api.yougotphoto.com/device/?action=get_device_new_photoapi=2.2api_key=f10df554a958fd10050e2d305241c7a3device_class=2serial_no=000E2EE5676Furl_no=24616cksn=44fe191d6cb4e7807f75938b5d72f07c; -O /tmp/webii/ygp_new_photo_list.txt--1999-11-30 00:04:21-- http://api.yougotphoto.com/device/?action=get_device_new_photoapi=2.2api_key=f10df554a958fd10050e2d305241c7a3device_class=2serial_no=000E2EE5676Furl_no=24616cksn=44fe191d6cb4e7807f75938b5d72f07cResolving api.yougotphoto.com... wget -4 -t 6 http://farm1.static.flickr.com/33/49038824_e4b04b7d9f_b.jpg; -O /tmp/webii/24616 74.124.203.218Connecting to api.yougotphoto.com|74.124.203.218|:80... --1999-11-30 00:04:22-- http://farm1.static.flickr.com/33/49038824_e4b04b7d9f_b.jpgResolving farm1.static.flickr.com... 74.124.203.218Connecting to farm1.static.flickr.com|74.124.203.218|:80... connected. _ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE
Re: wget-1.11.4 bug
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 k.c. chao wrote: Micah Cowan wrote: Have you reproduced this, or is this in theory? If the latter, what has led you to this conclusion? I don't see anything in the code that would cause this behavior. I reproduce this. But I can't make sure the really problem is in resolve_bind_address. In the attached message, both api.yougotphogo.com and farm1.static.flickr.com get the same ip(74.124.203.218). The two wget are called from two threads of a program. Yeah, I get 68.142.213.135 for the flickr.com address, currently. The thing is, though, those two threads should be running wgets under separate processes (I'm not sure how they couldn't be, but if they somehow weren't that would be using Wget other than how it was designed to be used). This problem sounds much more like an issue with the OS's API than an issue with Wget, to me. But we'd still want to work around it if it were feasible. What operating system are you running? Vista? - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIirT17M8hyUobTrERAjsuAJ0crMPYIQficu1csou8Tt0jDFKvpQCeNYk3 1FhXl3uUYj2IA53qI1oOJ8A= =DvdG -END PGP SIGNATURE-
Re: Wget
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hor Meng Yoong wrote: Hi: I understand that you are a very busy person. Sorry to disturb you. Hi; please use the mailing list for support requests. I've copied the list in my response. I am using wget to mirror (using ftp://) a user home directory from a unix machine. Wget default to the user's home directory. However, I also need to get /etc folder. So, I tried to use ../../../etc. It works but the result of the ftpped files are in %2E%2E/ %2E%2E/ %2E%2E Any means to overcome this, or rename the directory. Try the -nd option (you may also need -nH). You might prefer to fetch /etc in a separate invocation from the other things; perhaps with the -P option to specify a directory name. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIhi5O7M8hyUobTrERAl+YAJ9xaX5NivhEfzJLHKD5T3qs0nZuOACgg0eC IqFZMlz8obK+loKyQ6vXCWo= =gNqH -END PGP SIGNATURE-
Re: WGET bug...
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 HARPREET SAWHNEY wrote: Hi, I am getting a strange bug when I use wget to download a binary file from a URL versus when I manually download. The attached ZIP file contains two files: 05.upc --- manually downloaded dum.upc--- downloaded through wget wget adds a number of ascii characters to the head of the file and seems to delete a similar number from the tail. So the file sizes are the same but the addition and deletion renders the file useless. Could you please direct me on if I should be using some specific option to avoind this problem? In the future, it's useful to mention which version of Wget you're using. The problem you're having is that the server is adding the extra HTML at the front of your session, and then giving you the file contents anyway. It's a bug in the PHP code that serves the file. You're getting this extra content because you are not logged in when you're fetching it. You need to have Wget send a cookie with an login-session information, and then the server will probably stop sending the corrupting information at the head of the file. The site does not appear to use HTTP's authentication mechanisms, so the [EMAIL PROTECTED] bit in the URL doesn't do you any good. It uses Forms-and-cookies authentication. Hopefully, you're using a browser that stores its cookies in a text format, or that is capable of exporting to a text format. In that case, you can just ensure that you're logged in in your browser, and use the - --load-cookies=cookies.txt option to Wget to use the same session information. Otherwise, you'll need to use --save-cookies with Wget to simulate the login form post, which is tricky and requires some understanding of HTML Forms. - -- HTH, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFId9Vy7M8hyUobTrERAjCWAJ9niSjC5YdBDNcAbnBFWZX6D8AO7gCeM8nE i8jn5i5Y6wLX1g3Q2hlDgcM= =uOke -END PGP SIGNATURE-
Re: WGET bug...
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 HARPREET SAWHNEY wrote: Hi, Thanks for the prompt response. I am using GNU Wget 1.10.2 I tried a few things on your suggestion but the problem remains. 1. I exported the cookies file in Internet Explorer and specified that in the Wget command line. But same error occurs. 2. I have an open session on the site with my username and password. 3. I also tried running wget while I am downloading a file from the IE session on the site, but the same error. Sounds like you'll need to get the appropriate cookie by using Wget to login to the website. This requires site-specific information from the user-login form page, though, so I can't help you without that. If you know how to read some HTML, then you can find the HTML form used for posting username/password stuff, and use wget --keep-session-cookies --save-cookies=cookies.txt \ - --post-data='username=foopassword=bar' ACTION Where ACTION is the value of the form's action field, USERNAME and PASSWORD (and possibly further required values) are field names from the HTML form, and FOO and BAR is the username/password. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFId+w97M8hyUobTrERAmLsAJ91231iGeO/albrgRuuUCRp8zFcnwCgiX3H fDp2J2oTBKlxW17eQ2jaCAA= =Khmi -END PGP SIGNATURE-
RE: Wget 1.11.3 - case sensetivity and URLs
Coombe, Allan David (DPS) wrote: However, the case of the files on disk is still mixed - so I assume that wget is not using the URL it originally requested (harvested from the HTML?) to create directories and files on disk. So what is it using? A http header (if so, which one??). I think wget uses the case from the HTML page(s) for the file name; your proxy would need to change the URLs in the HTML pages to lower case too. Tony
Re: Wget 1.11.3 - case sensetivity and URLs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Tony Lewis wrote: Coombe, Allan David (DPS) wrote: However, the case of the files on disk is still mixed - so I assume that wget is not using the URL it originally requested (harvested from the HTML?) to create directories and files on disk. So what is it using? A http header (if so, which one??). I think wget uses the case from the HTML page(s) for the file name; your proxy would need to change the URLs in the HTML pages to lower case too. My understanding from David's post is that he claimed to have been doing just that: I modified the response from the web site to lowercase the urls in the html (actually I lowercased the whole response) and the data that wget put on disk was fully lowercased - problem solved - or so I thought. My suspicion is it's not quite working, though, as otherwise where would Wget be getting the mixed-case URLs? - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz Vl7REl1hNVNqdBrLqoygrcE= =jlBN -END PGP SIGNATURE-
RE: Wget 1.11.3 - case sensetivity and URLs
Sorry Guys - just an ID 10 T error on my part. I think I need to change 2 things in the proxy server. 1. URLs in the HTML being returned to wget - this works OK 2. The Content-Location header used when the web server reports a 301 Moved Permanently response - I think this works OK. When I reported that it wasn't working I hadn't done both at the same time. Cheers Allan -Original Message- From: Micah Cowan [mailto:[EMAIL PROTECTED] Sent: Wednesday, 25 June 2008 6:44 AM To: Tony Lewis Cc: Coombe, Allan David (DPS); 'Wget' Subject: Re: Wget 1.11.3 - case sensetivity and URLs -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Tony Lewis wrote: Coombe, Allan David (DPS) wrote: However, the case of the files on disk is still mixed - so I assume that wget is not using the URL it originally requested (harvested from the HTML?) to create directories and files on disk. So what is it using? A http header (if so, which one??). I think wget uses the case from the HTML page(s) for the file name; your proxy would need to change the URLs in the HTML pages to lower case too. My understanding from David's post is that he claimed to have been doing just that: I modified the response from the web site to lowercase the urls in the html (actually I lowercased the whole response) and the data that wget put on disk was fully lowercased - problem solved - or so I thought. My suspicion is it's not quite working, though, as otherwise where would Wget be getting the mixed-case URLs? - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz Vl7REl1hNVNqdBrLqoygrcE= =jlBN -END PGP SIGNATURE-
Re: Wget 1.11.3 - case sensetivity and URLs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Coombe, Allan David (DPS) wrote: OK - now I am confused. I found a perl based http proxy (named http::proxy funnily enough) that has filters to change both the request and response headers and data. I modified the response from the web site to lowercase the urls in the html (actually I lowercased the whole response) and the data that wget put on disk was fully lowercased - problem solved - or so I thought. However, the case of the files on disk is still mixed - so I assume that wget is not using the URL it originally requested (harvested from the HTML?) to create directories and files on disk. So what is it using? A http header (if so, which one??). I think you're missing something on your end; I couldn't begin to tell you what. Running with --debug will likely be informative. Wget uses the URL that successfully results in a file download. If the files on disk have mixed case, then it's because it was the result of a mixed-case request from Wget (which, in turn, must have either resulted from an explicit argument, or from HTML content). The only exception to the above is when you explicitly enable - --content-disposition support, in which case Wget will use any filename specified in a Content-Disposition header. Those are virtually never issued, except for CGI-based downloads (and you have to explicitly enable it). - -- Good luck! Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIXe0Z7M8hyUobTrERAkF5AJ9FOkx5XQJCx9vkTV9xr2zbYzp4jwCffrec zhdtjp59GOwt07YgvtolM8o= =FZ3m -END PGP SIGNATURE-
RE: Wget 1.11.3 - case sensetivity and URLs
OK - now I am confused. I found a perl based http proxy (named http::proxy funnily enough) that has filters to change both the request and response headers and data. I modified the response from the web site to lowercase the urls in the html (actually I lowercased the whole response) and the data that wget put on disk was fully lowercased - problem solved - or so I thought. However, the case of the files on disk is still mixed - so I assume that wget is not using the URL it originally requested (harvested from the HTML?) to create directories and files on disk. So what is it using? A http header (if so, which one??). Any ideas?? Cheers Allan
Re: wget doesn't load page-requisites from a) dynamic web page b) through https
Hello Stefan, I have a question: Am 2008-06-18 12:17:12, schrieb Stefan Nowak: wget \ --page-requisites \ --html-extension \ --convert-links \ --span-hosts \ --no-check-certificate \ --debug \ https://help.ubuntu.com/community/MacBookPro/ log.txt Why do you use log.txt instead of --output-file=log.txt or --append-output=log.txt Thanks, Greetings and nice Day/Evening Michelle Konzack Systemadministrator 24V Electronic Engineer Tamay Dogan Network Debian GNU/Linux Consultant -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ # Debian GNU/Linux Consultant # Michelle Konzack Apt. 917 ICQ #328449886 +49/177/935194750, rue de Soultz MSN LinuxMichi +33/6/61925193 67100 Strasbourg/France IRC #Debian (irc.icq.com) signature.pgp Description: Digital signature
RE: Wget 1.11.3 - case sensetivity and URLs
Thanks averyone for the contributions. Ultimately, our purpose is to process documents from the site into our search database, so probably the most important thing is to limit the number of files being processed. The case of the URLs in the html probably wouldn't cause us much concern, but I could see that it might be useful to convert a site for mirroring from a non-case sensetive (windows) environment to a case sensetive (li|u)nix one - this would need to include translation of urls in content as well as filenames on disk. In the meantime - does anyone know of a proxy server that could translate urls from mixed case to lower case. I thought that if we downloaded using wget via such a proxy server we might get the appropriate result. The other alternative we were thinking of was to post process the files with symlinks for all mixed case versions of files and directories (I think someone already suggested this - greate minds and all that...). I assume that wget would correctly use the symlink to determine the time/date stamp of the file for determining if it requires updating (or would it use the time/date stamp of the symlink?). I also assume that if wget downloaded the file it would overwrite the symlink and we would have to run our convert files to symlinks process again. Just to put it in perspective, the actual site is approximately 45gb (that's what the administrator said) and wget downloaded 100gb (463,000 files) when I did the first process. Cheers Allan -Original Message- From: Micah Cowan [mailto:[EMAIL PROTECTED] Sent: Saturday, 14 June 2008 7:30 AM To: Tony Lewis Cc: Coombe, Allan David (DPS); 'Wget' Subject: Re: Wget 1.11.3 - case sensetivity and URLs -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Tony Lewis wrote: Micah Cowan wrote: Unfortunately, nothing really comes to mind. If you'd like, you could file a feature request at https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option asking Wget to treat URLs case-insensitively. To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think you probably want to send the original case to the server (just in case it really does matter to the server). If you're going to treat different case URIs as matching then the lower-case version will have to be stored in the hash. The most important part (from the perspective that Allan voices) is that the versions written to disk use lower case characters. Well, that really depends. If it's doing a straight recursive download, without preexisting local files, then all that's really necessary is to do lookups/stores in the blacklist in a case-normalized manner. If preexisting files matter, then yes, your solution would fix it. Another solution would be to scan directory contents for the first name that matches case insensitively. That's obviously much less efficient, but has the advantage that the file will match at least one of the real cases from the server. As Matthias points out, your lower-case normalization solution could be achieved in a more general manner with a hook. Which is something I was planning on introducing perhaps in 1.13 anyway (so you could, say, run sed on the filenames before Wget uses them), so that's probably the approach I'd take. But probably not before 1.13, even if someone provides a patch for it in time for 1.12 (too many other things to focus on, and I'd like to introduce the external command hooks as a suite, if possible). OTOH, case normalization in the blacklists would still be useful, in addition to that mechanism. Could make another good addition for 1.13 (because it'll be more useful in combination with the rename hooks). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ nVYivipui+0TRmmK04kD2JE= =OMsD -END PGP SIGNATURE-
Re: Wget 1.11.3 - case sensetivity and URLs
a simple url-rewriting conf should fix the problem, wihout touch the file system everything can be done server side Best Regards On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS) [EMAIL PROTECTED] wrote: Thanks averyone for the contributions. Ultimately, our purpose is to process documents from the site into our search database, so probably the most important thing is to limit the number of files being processed. The case of the URLs in the html probably wouldn't cause us much concern, but I could see that it might be useful to convert a site for mirroring from a non-case sensetive (windows) environment to a case sensetive (li|u)nix one - this would need to include translation of urls in content as well as filenames on disk. In the meantime - does anyone know of a proxy server that could translate urls from mixed case to lower case. I thought that if we downloaded using wget via such a proxy server we might get the appropriate result. The other alternative we were thinking of was to post process the files with symlinks for all mixed case versions of files and directories (I think someone already suggested this - greate minds and all that...). I assume that wget would correctly use the symlink to determine the time/date stamp of the file for determining if it requires updating (or would it use the time/date stamp of the symlink?). I also assume that if wget downloaded the file it would overwrite the symlink and we would have to run our convert files to symlinks process again. Just to put it in perspective, the actual site is approximately 45gb (that's what the administrator said) and wget downloaded 100gb (463,000 files) when I did the first process. Cheers Allan -Original Message- From: Micah Cowan [mailto:[EMAIL PROTECTED] Sent: Saturday, 14 June 2008 7:30 AM To: Tony Lewis Cc: Coombe, Allan David (DPS); 'Wget' Subject: Re: Wget 1.11.3 - case sensetivity and URLs -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Tony Lewis wrote: Micah Cowan wrote: Unfortunately, nothing really comes to mind. If you'd like, you could file a feature request at https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option asking Wget to treat URLs case-insensitively. To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think you probably want to send the original case to the server (just in case it really does matter to the server). If you're going to treat different case URIs as matching then the lower-case version will have to be stored in the hash. The most important part (from the perspective that Allan voices) is that the versions written to disk use lower case characters. Well, that really depends. If it's doing a straight recursive download, without preexisting local files, then all that's really necessary is to do lookups/stores in the blacklist in a case-normalized manner. If preexisting files matter, then yes, your solution would fix it. Another solution would be to scan directory contents for the first name that matches case insensitively. That's obviously much less efficient, but has the advantage that the file will match at least one of the real cases from the server. As Matthias points out, your lower-case normalization solution could be achieved in a more general manner with a hook. Which is something I was planning on introducing perhaps in 1.13 anyway (so you could, say, run sed on the filenames before Wget uses them), so that's probably the approach I'd take. But probably not before 1.13, even if someone provides a patch for it in time for 1.12 (too many other things to focus on, and I'd like to introduce the external command hooks as a suite, if possible). OTOH, case normalization in the blacklists would still be useful, in addition to that mechanism. Could make another good addition for 1.13 (because it'll be more useful in combination with the rename hooks). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ nVYivipui+0TRmmK04kD2JE= =OMsD -END PGP SIGNATURE- -- -mmw
Re: Wget 1.11.3 - case sensetivity and URLs
without touching the file system On Thu, Jun 19, 2008 at 9:23 AM, mm w [EMAIL PROTECTED] wrote: a simple url-rewriting conf should fix the problem, wihout touch the file system everything can be done server side Best Regards On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS) [EMAIL PROTECTED] wrote: Thanks averyone for the contributions. Ultimately, our purpose is to process documents from the site into our search database, so probably the most important thing is to limit the number of files being processed. The case of the URLs in the html probably wouldn't cause us much concern, but I could see that it might be useful to convert a site for mirroring from a non-case sensetive (windows) environment to a case sensetive (li|u)nix one - this would need to include translation of urls in content as well as filenames on disk. In the meantime - does anyone know of a proxy server that could translate urls from mixed case to lower case. I thought that if we downloaded using wget via such a proxy server we might get the appropriate result. The other alternative we were thinking of was to post process the files with symlinks for all mixed case versions of files and directories (I think someone already suggested this - greate minds and all that...). I assume that wget would correctly use the symlink to determine the time/date stamp of the file for determining if it requires updating (or would it use the time/date stamp of the symlink?). I also assume that if wget downloaded the file it would overwrite the symlink and we would have to run our convert files to symlinks process again. Just to put it in perspective, the actual site is approximately 45gb (that's what the administrator said) and wget downloaded 100gb (463,000 files) when I did the first process. Cheers Allan -Original Message- From: Micah Cowan [mailto:[EMAIL PROTECTED] Sent: Saturday, 14 June 2008 7:30 AM To: Tony Lewis Cc: Coombe, Allan David (DPS); 'Wget' Subject: Re: Wget 1.11.3 - case sensetivity and URLs -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Tony Lewis wrote: Micah Cowan wrote: Unfortunately, nothing really comes to mind. If you'd like, you could file a feature request at https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option asking Wget to treat URLs case-insensitively. To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think you probably want to send the original case to the server (just in case it really does matter to the server). If you're going to treat different case URIs as matching then the lower-case version will have to be stored in the hash. The most important part (from the perspective that Allan voices) is that the versions written to disk use lower case characters. Well, that really depends. If it's doing a straight recursive download, without preexisting local files, then all that's really necessary is to do lookups/stores in the blacklist in a case-normalized manner. If preexisting files matter, then yes, your solution would fix it. Another solution would be to scan directory contents for the first name that matches case insensitively. That's obviously much less efficient, but has the advantage that the file will match at least one of the real cases from the server. As Matthias points out, your lower-case normalization solution could be achieved in a more general manner with a hook. Which is something I was planning on introducing perhaps in 1.13 anyway (so you could, say, run sed on the filenames before Wget uses them), so that's probably the approach I'd take. But probably not before 1.13, even if someone provides a patch for it in time for 1.12 (too many other things to focus on, and I'd like to introduce the external command hooks as a suite, if possible). OTOH, case normalization in the blacklists would still be useful, in addition to that mechanism. Could make another good addition for 1.13 (because it'll be more useful in combination with the rename hooks). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ nVYivipui+0TRmmK04kD2JE= =OMsD -END PGP SIGNATURE- -- -mmw -- -mmw
RE: Wget 1.11.3 - case sensetivity and URLs
mm w wrote: a simple url-rewriting conf should fix the problem, wihout touch the file system everything can be done server side Why do you assume the user of wget has any control over the server from which content is being downloaded?
Re: Wget 1.11.3 - case sensetivity and URLs
not al, but in this particular case I pretty sure they have On Thu, Jun 19, 2008 at 10:42 AM, Tony Lewis [EMAIL PROTECTED] wrote: mm w wrote: a simple url-rewriting conf should fix the problem, wihout touch the file system everything can be done server side Why do you assume the user of wget has any control over the server from which content is being downloaded? -- -mmw
Re: wget doesn't load page-requisites from a) dynamic web page b) through https
Dear Stefan, If you take a look at the source of the page, you'll see this: meta name=robots content=index,nofollow Simply add -e robots=off to your arguments and wget will ignore any robots.txt files or tags. With that it should download everything you want. (I did not find this myself, credits go to sxav for pointing this out. ;) Cheers, Valentin -- The last time someone listened to a Bush, a bunch of people wandered in the desert for 40 years.
Re: wget doesn't load page-requisites from a) dynamic web page b) through https
On Jun 18, 2008, at 5:17 AM, Stefan Nowak wrote: where do I set the locale of the CLI environment of MacOSX? You should set the LANG environment variable to the desired locale, and one which is supported on your system; you can look at the directories in /usr/share/locale to see what locales are available. For example, if you want American English, set LANG to en_US. In the Bash shell, you can type export LANG=en_US In the Tcsh shell, you can type setenv LANG en_US To find out which shell you use, type echo $SHELL
Re: wget doesn't load page-requisites from a) dynamic web page b) through https
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ryan Schmidt wrote: For example, if you want American English, set LANG to en_US. In the Bash shell, you can type export LANG=en_US In the Tcsh shell, you can type setenv LANG en_US To find out which shell you use, type echo $SHELL FYI: It's not in any current release, but current mainline has support for the special [EMAIL PROTECTED] for LANGUAGE (still may need to set LANG=en_US or something). This causes all quoted strings to be rendered in boldface, using terminal escape sequences. I've found it pleasant to use that setting for my own purposes. The [EMAIL PROTECTED] LANGUAGE setting is also supported (converts to proper left/right-quotemarks, but no terminal sequences); but I've rigged LANG=en_US to have the same effect ([EMAIL PROTECTED] is copied to en_US.po). Again, this is only in the mainline repo, and not in any release. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIWXvT7M8hyUobTrERAmedAJ44nMxqJCyIBox1LDv/FOibkCslIACeLoS3 Beb0toZwvx29J4Sa3AZk62k= =Sreb -END PGP SIGNATURE-
Re: Wget 1.11.3 - case sensetivity and URLs
On Sat, Jun 14, 2008 at 4:30 PM, Tony Lewis [EMAIL PROTECTED] wrote: mm w wrote: Hi, after all, after all it's only my point of view :D anyway, /dir/file, dir/File, non-standard Dir/file, non-standard and /Dir/File non-standard According to RFC 2396: The path component contains data, specific to the authority (or the scheme if there is no authority component), identifying the resource within the scope of that scheme and authority. In other words, those names are well within the standard when the server understands them. As far as I know, there is nothing in Internet standards restricting mixed case paths. :) read again, nobody does except some punk-head folks that's it, if the server manages non-standard URL, it's not my concern, for me it doesn't exist Oh. I see. You're writing to say that wget should only implement features that are meaningful to you. Thanks for your narcissistic input. no i'm not such a jerk, a simple grep/sed on the website source to remove the malicious URL should be fine, or an HTTP redirection when the malicious non-standard URL is called in other hand, if wget changes every links in lowercase, some people should have the opposite problem a golden rule: never distributing mixed-case URL (to your users), a simple respect for them and everything in lower-case Tony -- -mmw
RE: Wget 1.11.3 - case sensetivity and URLs
mm w wrote: Hi, after all, after all it's only my point of view :D anyway, /dir/file, dir/File, non-standard Dir/file, non-standard and /Dir/File non-standard According to RFC 2396: The path component contains data, specific to the authority (or the scheme if there is no authority component), identifying the resource within the scope of that scheme and authority. In other words, those names are well within the standard when the server understands them. As far as I know, there is nothing in Internet standards restricting mixed case paths. that's it, if the server manages non-standard URL, it's not my concern, for me it doesn't exist Oh. I see. You're writing to say that wget should only implement features that are meaningful to you. Thanks for your narcissistic input. Tony
RE: Wget 1.11.3 - case sensetivity and URLs
Micah Cowan wrote: Unfortunately, nothing really comes to mind. If you'd like, you could file a feature request at https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option asking Wget to treat URLs case-insensitively. To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think you probably want to send the original case to the server (just in case it really does matter to the server). If you're going to treat different case URIs as matching then the lower-case version will have to be stored in the hash. The most important part (from the perspective that Allan voices) is that the versions written to disk use lower case characters. Tony
Re: Wget 1.11.3 - case sensetivity and URLs
standard: the URL are case-insensitive you can adapt your software because some people don't respect standard, we are not anymore in 90's, let people doing crapy things deal with their crapy world Cheers! On Fri, Jun 13, 2008 at 2:08 PM, Tony Lewis [EMAIL PROTECTED] wrote: Micah Cowan wrote: Unfortunately, nothing really comes to mind. If you'd like, you could file a feature request at https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option asking Wget to treat URLs case-insensitively. To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think you probably want to send the original case to the server (just in case it really does matter to the server). If you're going to treat different case URIs as matching then the lower-case version will have to be stored in the hash. The most important part (from the perspective that Allan voices) is that the versions written to disk use lower case characters. Tony -- -mmw
Re: Wget 1.11.3 - case sensetivity and URLs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Tony Lewis wrote: Micah Cowan wrote: Unfortunately, nothing really comes to mind. If you'd like, you could file a feature request at https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option asking Wget to treat URLs case-insensitively. To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think you probably want to send the original case to the server (just in case it really does matter to the server). If you're going to treat different case URIs as matching then the lower-case version will have to be stored in the hash. The most important part (from the perspective that Allan voices) is that the versions written to disk use lower case characters. Well, that really depends. If it's doing a straight recursive download, without preexisting local files, then all that's really necessary is to do lookups/stores in the blacklist in a case-normalized manner. If preexisting files matter, then yes, your solution would fix it. Another solution would be to scan directory contents for the first name that matches case insensitively. That's obviously much less efficient, but has the advantage that the file will match at least one of the real cases from the server. As Matthias points out, your lower-case normalization solution could be achieved in a more general manner with a hook. Which is something I was planning on introducing perhaps in 1.13 anyway (so you could, say, run sed on the filenames before Wget uses them), so that's probably the approach I'd take. But probably not before 1.13, even if someone provides a patch for it in time for 1.12 (too many other things to focus on, and I'd like to introduce the external command hooks as a suite, if possible). OTOH, case normalization in the blacklists would still be useful, in addition to that mechanism. Could make another good addition for 1.13 (because it'll be more useful in combination with the rename hooks). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ nVYivipui+0TRmmK04kD2JE= =OMsD -END PGP SIGNATURE-
Re: Wget 1.11.3 - case sensetivity and URLs
In the VMS world, where file name case may matter, but usually doesn't, the normal scheme is to preserve case when creating files, but to do case-insensitive comparisons on file names. From Tony Lewis: To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think that that's the wrong way to look at it. Implementation details like name hashing may also need to be adjusted, but this shouldn't be too hard. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
RE: Wget 1.11.3 - case sensetivity and URLs
mm w wrote: standard: the URL are case-insensitive you can adapt your software because some people don't respect standard, we are not anymore in 90's, let people doing crapy things deal with their crapy world You obviously missed the point of the original posting: how can one conveniently mirror a site whose server uses case insensitive names onto a server that uses case sensitive names. If the original site has the URI strings /dir/file, dir/File, Dir/file, and /Dir/File, the same local file will be returned. However, wget will treat those as unique directories and files and you wind up with four copies. Allan asked if there is a way to have wget just create one copy and proposed one way that might accomplish that goal. Tony
RE: Wget 1.11.3 - case sensetivity and URLs
Steven M. Schweda wrote: From Tony Lewis: To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think that that's the wrong way to look at it. Implementation details like name hashing may also need to be adjusted, but this shouldn't be too hard. OK. How would you normalize the names? Tony
Re: Wget 1.11.3 - case sensetivity and URLs
Hi, after all, after all it's only my point of view :D anyway, /dir/file, dir/File, non-standard Dir/file, non-standard and /Dir/File non-standard that's it, if the server manages non-standard URL, it's not my concern, for me it doesn't exist On Fri, Jun 13, 2008 at 3:12 PM, Tony Lewis [EMAIL PROTECTED] wrote: mm w wrote: standard: the URL are case-insensitive you can adapt your software because some people don't respect standard, we are not anymore in 90's, let people doing crapy things deal with their crapy world You obviously missed the point of the original posting: how can one conveniently mirror a site whose server uses case insensitive names onto a server that uses case sensitive names. If the original site has the URI strings /dir/file, dir/File, Dir/file, and /Dir/File, the same local file will be returned. However, wget will treat those as unique directories and files and you wind up with four copies. Allan asked if there is a way to have wget just create one copy and proposed one way that might accomplish that goal. Tony -- -mmw
Re: Wget 1.11.3 - case sensetivity and URLs
Hi list! saddly I couldn't find the E-Mail of Allan (maybe because I'm atached by the news gateway) so this is a list-only-post. Micah Cowan wrote: Hi Allan, You'll generally get better results if you post to the mailing list (wget@sunsite.dk). I've added it to the recipients list. Coombe, Allan David (DPS) wrote: Hi Micah, First some context We are using wget 1.11.3 to mirror a web site so we can do some offline processing on it. The mirror is on a Solaris 10 x86 server. The problem we are getting appears to be because the URLs in the HTML pages that are harvested by wget for downloading have mixed case (the site we are mirroring is running on a Windows 2000 server using IIS) and the directory structure created on the mirror have 'duplicate' directories because of the mixed case. For example, the URLs in HTML pages /Senate/committees/index.htm and /senate/committees/index.htm refer to the same file but wget creates 2 different directory structures on the mirror site for these URLs. Ok... at this point I need to ask whether you try to mirror or just backup the site. The main problem is easy: the moment you want a working mirror you need those mixed-case files or rewrite the url to a unique casing. At this point it seems to be most practical to either introduce a hook like --restrict-file-names to modify the name of the local copy and the links inside the downloaded files in the same way. An other option is to create symlinks for the different directory cases. That would safe half the overhead, i guess. To create such a symlink structure you could use the output of find /mirror/basedir -type d | sort -f hope that helps. Matthias
Re: Wget 1.11.3 - case sensetivity and URLs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Allan, You'll generally get better results if you post to the mailing list (wget@sunsite.dk). I've added it to the recipients list. Coombe, Allan David (DPS) wrote: Hi Micah, First some context… We are using wget 1.11.3 to mirror a web site so we can do some offline processing on it. The mirror is on a Solaris 10 x86 server. The problem we are getting appears to be because the URLs in the HTML pages that are harvested by wget for downloading have mixed case (the site we are mirroring is running on a Windows 2000 server using IIS) and the directory structure created on the mirror have 'duplicate' directories because of the mixed case. For example, the URLs in HTML pages /Senate/committees/index.htm and /senate/committees/index.htm refer to the same file but wget creates 2 different directory structures on the mirror site for these URLs. This appears to be a fairly basic thing, but we can't see any wget options that allow us to treat URLs case insensetively. We don't really want to post-process the site just to merge the files and directories with different case. Unfortunately, nothing really comes to mind. If you'd like, you could file a feature request at https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option asking Wget to treat URLs case-insensitively. Finding local files case-insensitively, on a case-sensitive filesystem, would be a PITA; but adding and looking up URLs in the internal blacklist hash wouldn't be too hard. I probably wouldn't get to that for a while, though. Another useful option might be to change the name of index files, so that, for instance, you could have URLs like http://foo/ result in foo/index.htm or foo/default.html, rather than foo/index.html. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIUG937M8hyUobTrERAqq2AJ48mGvcFCSxnouTFqYTuRHzVgwYdgCeLegI vkdzf3Lu+Vn5diCOHk5CRhc= =IlG9 -END PGP SIGNATURE-
Re: wget 1.11.1 make test fails
Alain Guibert [EMAIL PROTECTED] writes: On Wednesday, April 2, 2008 at 23:09:52 +0200, Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME The libc 5.4.33 fnmatch() supports FNM_PATHNAME, and there is code apparently intending to return FNM_NOMATCH on a slash. But this code seems to be rather broken. Or it could be that you're picking up a different fnmatch.h that sets up a different value for FNM_PATHNAME. Do you have more than one fnmatch.h installed on your system?
Re: wget 1.11.1 make test fails
Alain Guibert [EMAIL PROTECTED] writes: Maybe you could put a breakpoint in fnmatch and see what goes wrong? The for loop intended to eat several characters from the string also advances the pattern pointer. This one reaches the end of the pattern, and points to a NUL. It is not a '*' anymore, so the loop exits prematurely. Just below, a test for NUL returns 0. Thanks for the analysis. Looking at the current fnmatch code in gnulib, it seems that the fix is to change that NUL test to something like: if (c == '\0') { /* The wildcard(s) is/are the last element of the pattern. If the name is a file name and contains another slash this means it cannot match. */ int result = (flags FNM_PATHNAME) == 0 ? 0 : FNM_NOMATCH; if (flags FNM_PATHNAME) { if (!strchr (n, '/')) result = 0; } return result; } But I'm not at all sure that it covers all the needed cases. Maybe we should simply switch to gnulib-provided fnmatch? Unfortunately that one is quite complex and quite hard for the '**' extension Micah envisions. There might be other fnmatch implementations out there in GNU which are debugged but still simpler than the gnulib/glibc one. It's kind of ironic that while the various system fnmatches were considered broken, the one Wget was using (for many years unconditionally!) was also broken.
Re: wget 1.11.1 make test fails
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hrvoje Niksic wrote: Alain Guibert [EMAIL PROTECTED] writes: Maybe you could put a breakpoint in fnmatch and see what goes wrong? The for loop intended to eat several characters from the string also advances the pattern pointer. This one reaches the end of the pattern, and points to a NUL. It is not a '*' anymore, so the loop exits prematurely. Just below, a test for NUL returns 0. Thanks for the analysis. Looking at the current fnmatch code in gnulib, it seems that the fix is to change that NUL test to something like: if (c == '\0') { /* The wildcard(s) is/are the last element of the pattern. If the name is a file name and contains another slash this means it cannot match. */ int result = (flags FNM_PATHNAME) == 0 ? 0 : FNM_NOMATCH; if (flags FNM_PATHNAME) { if (!strchr (n, '/')) result = 0; } return result; } But I'm not at all sure that it covers all the needed cases. I'm thinking not: the loop still shouldn't be incrementing n, since that forces each additional * to match at least one character, doesn't it? Gnulib's version seems to handle that better. Maybe we should simply switch to gnulib-provided fnmatch? Unfortunately that one is quite complex and quite hard for the '**' extension Micah envisions. There might be other fnmatch implementations out there in GNU which are debugged but still simpler than the gnulib/glibc one. Maybe. I'm not sure ** would be too hard to add to gnulib's fnmatch, just have to toggle with the FNM_FILE_NAME tests within the '*' case, if we see an immediate second '*'. But maybe ** as part of a *?**? sequence is more complex. I don't think so, though. The main thing is that we need it to support the invalid sequence stuff. Hm; I'm not sure we'll ever want fnmatch() to be locale-aware, though. User-specified match patterns should interpret characters based on the locale; but the source strings may be in different encodings altogether. If we solve this by transcoding to the current locale, we may find that the user's locale doesn't support all of the characters that the original string's encoding does. Probably we'll need to transcode both to Unicode before comparison. In the meantime, though, I think we want a simple byte-by-byte match. Perhaps it's best to (a) use our custom matcher, ignoring the system's (so we don't get locale specialness), and (b) fix it, providing as thorough test coverage as possible. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH9jWi7M8hyUobTrERAglwAKCDnpnDjr44Ovgh/oBuzkM4mu/gKACeNnN8 arvFSrCEBatNeO29fzHxuU4= =QDMp -END PGP SIGNATURE-
Re: wget 1.11.1 make test fails
On Thursday, April 3, 2008 at 22:37:41 +0200, Hrvoje Niksic wrote: Or it could be that you're picking up a different fnmatch.h that sets up a different value for FNM_PATHNAME. Do you have more than one fnmatch.h installed on your system? I have only /usr/include/fnmatch.h installed, identical to the file in the libc-5.4.33 tarball, and defining the same values as wget's src/sysdep.h (even comments are identical). Just my fnmatch.h defines two more flags, FNM_LEADING_DIR=8 and FNM_CASEFOLD=16, and defines an FNM_FILE_NAME alias (commented as Preferred GNU name) to FNM_PATHNAME=1 (the libc code uses only this alias). Anyway I had noticed your comment about incompatible headers, and double-checked your little test program also with explicit value 1: same results. BTW everybody should be able to reproduce the make test failure, on any system, just by #undefining SYSTEM_FNMATCH in src/sysdep.h Alain.
Re: wget 1.11.1 make test fails
On Thursday, April 3, 2008 at 9:14:52 -0700, Micah Cowan wrote: Are you certain you rebuilt cmpt.o? This seems pretty unlikely, to me. Certain: make test after touching src/sysdep.h rebuilds both cmpt.o, the normal in src/ and the one in tests/. And both those cmpt.o become 784 bytes bigger without SYSTEM_FNMATCH. Alain.
Re: wget 1.11.1 make test fails
Alain Guibert [EMAIL PROTECTED] writes: This old system does HAVE_WORKING_FNMATCH_H (and thus SYSTEM_FNMATCH). When #undefining SYSTEM_FNMATCH, the test still fails at the very same line. And then it also fails on modern systems. I guess this points at the embedded src/cmpt.c:fnmatch() replacement? Well, it would point to a problem with both the fnmatch replacement and the older system fnmatch. Our fnmatch (coming from an old release of Bash, but otherwise very well-tested, both in Bash and Wget) is careful to special-case '/' only if FNM_PATHNAME is specified. Maybe you could put a breakpoint in fnmatch and see what goes wrong?
Re: wget 1.11.1 make test fails
On Wednesday, April 2, 2008 at 23:09:52 +0200, Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME The libc 5.4.33 fnmatch() supports FNM_PATHNAME, and there is code apparently intending to return FNM_NOMATCH on a slash. But this code seems to be rather broken. | printf(%d\n, fnmatch(foo*, foo/bar, FNM_PATHNAME)); It should print a non-zero value. Zero on the old system, FNM_NOMATCH on a recent one. Alain.
Re: wget fails using proxy with https-protocol
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Micah Cowan wrote: The log shows that: 1. Wget still doesn't wait for the Proxy to ask for authentication, before sending Proxy-Authorization headers with its first request. 2. Apparently, when going through a proxy, Wget now correctly waits to receive a challenge from the destination server (as I intended), but then _doesn't_ respond to the challenge with an Authorization header, instead just treating the (first) 401 as a final header. Slava, could you perhaps download and install Wget 1.11.1, and try it with the --auth-no-challenge option? That was added to support a case when there was a genuine need for Wget's older, less secure authentication behavior; it's intended to disable the new behavior. It may or may not fix your problem, and I'd be interested to know which it is. :) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH9QzK7M8hyUobTrERAuOUAJ4ygaAyhihkeM/tG0j7hMexnHJZwwCeKhzi r3OHfZk8bDZu0DnQljyP7vU= =6i/0 -END PGP SIGNATURE-
Re: wget 1.11.1 make test fails
On Thursday, April 3, 2008 at 11:08:27 +0200, Hrvoje Niksic wrote: Well, it would point to a problem with both the fnmatch replacement and the older system fnmatch. Our fnmatch (coming from an old release of Bash The fnmatch()es in libc 5.4.33 and in Wget are twins. They differ on some minor details like FNM_CASEFOLD support, and cosmetic things like parenthesis around return(code). The part dealing with * in pattern is functionaly identical. Maybe you could put a breakpoint in fnmatch and see what goes wrong? The for loop intended to eat several characters from the string also advances the pattern pointer. This one reaches the end of the pattern, and points to a NUL. It is not a '*' anymore, so the loop exits prematurely. Just below, a test for NUL returns 0. The body of the loop, returning FNM_NOMATCH on a slash, is not executed at all. That isn't moderately broken, is it? Alain.
fnmatch [Re: wget 1.11.1 make test fails]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Alain Guibert wrote: The for loop intended to eat several characters from the string also advances the pattern pointer. This one reaches the end of the pattern, and points to a NUL. It is not a '*' anymore, so the loop exits prematurely. Just below, a test for NUL returns 0. The body of the loop, returning FNM_NOMATCH on a slash, is not executed at all. That isn't moderately broken, is it? I haven't stepped through it, but it sure looks broken to my eyes too. I am tired at the moment, though, so may be missing something. GNUlib has an fnmatch, which might be worth considering for use; but AIUI it suffers from the same overly-locale-aware problem that system fnmatches can suffer from (fnmatch fails when the string isn't encoded properly for the current locale; we often don't even _know_ the original encoding, especially for FTP, and mainly want * to match any arbitrary string of byte values). They were looking for someone to address that issue: http://lists.gnu.org/archive/html/bug-gnulib/2008-02/msg00019.html Perhaps, if I'm motivated and somehow scrounge the time, I can fix the problem in their code, and then use it in ours? :) Or, if someone else with more time would like to tackle it, I'm sure that'd also be welcome. :) I responded to the message linked above with a note that Wget also had a need for such functionality, along with some questions about the approach, but hadn't received a response. Maybe I'll try again. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH9XBy7M8hyUobTrERAtReAJ94Ac0ClInQOE7qq7OQxon87zj7JACeOTz3 Lfafi0U2phRDnFqQ2IPSx+s= =9yU/ -END PGP SIGNATURE-
Re: wget 1.11.1 make test fails
Hello Micah, On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote: could you try to isolate which part of test_dir_matches_p is failing? The only failing src/utils.c test_array[] line is: | { { *COMPLETE, NULL, NULL }, foo/!COMPLETE, false }, I don't understand enough of dir_matches_p() and fnmatch() to guess what is supposed to happen. But with false replaced by true, this test and following succeed. | ALL TESTS PASSED | Tests run: 7 Of course this test then fails on newer systems. Alain.
Re: wget 1.11.1 make test fails
Alain Guibert [EMAIL PROTECTED] writes: Hello Micah, On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote: could you try to isolate which part of test_dir_matches_p is failing? The only failing src/utils.c test_array[] line is: | { { *COMPLETE, NULL, NULL }, foo/!COMPLETE, false }, I don't understand enough of dir_matches_p() and fnmatch() to guess what is supposed to happen. But with false replaced by true, this test and following succeed. '*' is not supposed to match '/' in regular fnmatch. It sounds like a libc problem rather than a gcc problem. Try #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.
Re: wget 1.11.1 make test fails
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hrvoje Niksic wrote: Alain Guibert [EMAIL PROTECTED] writes: Hello Micah, On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote: could you try to isolate which part of test_dir_matches_p is failing? The only failing src/utils.c test_array[] line is: | { { *COMPLETE, NULL, NULL }, foo/!COMPLETE, false }, I don't understand enough of dir_matches_p() and fnmatch() to guess what is supposed to happen. But with false replaced by true, this test and following succeed. '*' is not supposed to match '/' in regular fnmatch. Well, that's assuming you pass it the FNM_PATHNAME flag (which, for dir_matches_p, we always do). It sounds like a libc problem rather than a gcc problem. Try #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then. It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME: I mean, don't most shells rely on this to handle file globbing and whatnot? - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH86+L7M8hyUobTrERApHKAJsFbO8+PtAqFhHJ2Psv1AuKSy17YwCcDsi2 9WHcJ0Pzkc4XmNbcEUCXf6U= =r8ZV -END PGP SIGNATURE-
Re: wget fails using proxy with https-protocol
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Micah Cowan wrote: Julien, I've CC'd you, in case you think this might be something you'd want to add to your GSoC proposal. If it _is_, it's probably something that should be done before the rest, so I can backport it into the 1.11 branch for a 1.11.2 release (since this is an important regression), rather than make people wait for 1.12 to come out (which is where I expect the rest of the authorization improvements would go). Er, on reflection, that's a terrible idea, given that coding for GSoC doesn't even start until nearly June, and this is a serious regression that should be fixed as soon as it can be got to. Still, if you'd like to tackle it out-of-band, that'd be handy. :) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH87mY7M8hyUobTrERAuyjAJ0XJ8ImAFZ/J49EGQlc+HWWNdxhQACgiK3U bgyhQErH//V6bDkaeE9mLYM= =3fn1 -END PGP SIGNATURE-
Re: wget 1.11.1 make test fails
Micah Cowan [EMAIL PROTECTED] writes: It sounds like a libc problem rather than a gcc problem. Try #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then. It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME: I mean, don't most shells rely on this to handle file globbing and whatnot? The conventional wisdom among free software of the 90s was that fnmatch() was too buggy to be useful. For that reason all free shells rolled their own fnmatch, as did other programs that needed it, including Wget. Maybe the conventional wisdom was right for the reporter's system. Another possibility is that something else is installing fnmatch.h in a directory on the compiler's search path and breaking the system fnmatch. IIRC Apache was a known culprit that installed fnmatch.h in /usr/local/include. That was another reason why Wget used to completely ignore system-provided fnmatch. In any case, it should be easy enough to isolate the problem: #include stdio.h #include fnmatch.h int main() { printf(%d\n, fnmatch(foo*, foo/bar, FNM_PATHNAME)); return 0; } It should print a non-zero value.
Re: wget 1.11.1 make test fails
Micah Cowan [EMAIL PROTECTED] writes: I'm wondering whether it might make sense to go back to completely ignoring the system-provided fnmatch? One argument against that approach is that it increases code size on systems that do correctly implement fnmatch, i.e. on most modern Unixes that we are targeting. Supporting I18N file names would require modifications to our fnmatch; but on the other hand, we still need it for Windows, so we'd have to make those changes anyway. Providing added value in our fnmatch implementation should go a long way towards preventing complaints of code bloat. In particular, it would probably resolve the remaining issue with that one bug you reported about fnmatch() failing on strings whose encoding didn't match the locale. It would. Additionally, I've been toying with the idea of adding something like a ** to match all characters, including slashes. That would be great. That kind of thing is known to zsh users anyway, and it's a useful feature.
Re: Wget 1.11 build fails on old Linux
On Monday, February 25, 2008 at 16:32:21 +0100, Alain Guibert wrote: On an old Debian Bo system (kernel 2.0.40, gcc 2.7.2.1, libc 5.4.33), building Wget 1.11 fails: While wget 1.11.1 builds and works OK. Thank you very much, gentlemen! Alain.
Re: wget 1.11.1 make test fails
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Alain Guibert wrote: Hello, With an old gcc 2.7.2.1 compiler, wget 1.11.1 make test fails: | gcc -I. -I. -I./../src -DHAVE_CONFIG_H -DSYSTEM_WGETRC=\/usr/local/etc/wgetrc\ -DLOCALEDIR=\/usr/local/share/locale\ -O2 -Wall -DTESTING -c ../src/test.c | ../src/test.c: In function `all_tests': | ../src/test.c:51: parse error before `const' snip The attached make-test.patch seems to fix this. Yeah; that's invalid C90 code; declaration following statement. I'll fix that. However later the 3rd test fails: | ./unit-tests | RUNNING TEST test_parse_content_disposition... | PASSED | | RUNNING TEST test_subdir_p... | PASSED | | RUNNING TEST test_dir_matches_p... | test_dir_matches_p: wrong result | Tests run: 3 | make[1]: *** [run-unit-tests] Error 1 | make[1]: Leaving directory `/tmp/wget-1.11.1/tests' | make: *** [test] Error 2 That's an interesting failure. I wonder if it's one of the new cases I just added... In any case, it runs through fine for me. This suggests a difference in behavior between your system fnmatch function and mine (since that should be the only bit of external code that dir_matches_p relies on). Pity the tests don't give much clue as to the specifics of what failed... there are about 10 tests for test_dir_matches_p, any of which could have caused the problem. The whole testing thing needs some serious rework; which is my current top priority, when I find time for it (GSoC is eating everything, right now). make test isn't actually expected to work completely, right now; some of the .px tests are known to be broken/missing. They're basically provided as-is. I thought about removing them for the official package; maybe I should have. But if I had, I'd still be blissfully unaware of this potential problem. If you know how, and don't mind, could you try to isolate which part of test_dir_matches_p is failing? Perhaps augmenting the error message to spit the match-list and string arguments... - -- Thanks, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH8S/v7M8hyUobTrERAhrPAJ9N+XqLeVP0NN9HkLxO162Zf2uJnACeMwUo kew/FkMA2GljqWiPG6IC+zs= =fQSH -END PGP SIGNATURE-
Re: wget aborts when file exists
(I am not subscribed to the bug-list) ** On 12.03.2008 at 20:36 Charles wrote: On Wed, Mar 12, 2008 at 12:46 AM, Aleksandar Radulovic [EMAIL PROTECTED] wrote: (I am not subscribed to the bug-list) Hello, I use wget to retreive recurively images from a site, which are randomly changed on a daily basis . I wrote small batch which worked until sistem upgrade. Now the new version of wget is installed but it aborts when any file already exists. When I tried this in my wget, I got different behavior with wget 1.11 alpha and wget 1.10.2 D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. File `localhost/test/a.gif' already there; not retrieving. File `localhost/test/b.gif' already there; not retrieving. File `localhost/test/c.jpg' already there; not retrieving. FINISHED --20:31:41-- Downloaded: 0 bytes in 0 files I think wget 1.10.2 behavior is more correct. Anyway it did not abort in my case. --- Charles It brokes and it is wget 1.10.2 . I really don't know why, and I can't have influence in that because I am not an administrator of the system, just an user. However, it seems that this bug occures. Aca Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs
Re: wget aborts when file exists
Charles [EMAIL PROTECTED] writes: On Thu, Mar 13, 2008 at 1:17 AM, Hrvoje Niksic [EMAIL PROTECTED] wrote: It assums, though, that the preexisting index.html corresponds to the one that you were trying to download; it's unclear to me how wise that is. That's what -nc does. But the question is why it assumes that dependent files are also present. Because I repeated the command, and the files have all been downloaded before. We know that, but Wget 1.11 doesn't seem to check it. It only checks index.html, but not the other dependent files.
Re: wget aborts when file exists
On Wed, Mar 12, 2008 at 12:46 AM, Aleksandar Radulovic [EMAIL PROTECTED] wrote: (I am not subscribed to the bug-list) Hello, I use wget to retreive recurively images from a site, which are randomly changed on a daily basis . I wrote small batch which worked until sistem upgrade. Now the new version of wget is installed but it aborts when any file already exists. When I tried this in my wget, I got different behavior with wget 1.11 alpha and wget 1.10.2 D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. File `localhost/test/a.gif' already there; not retrieving. File `localhost/test/b.gif' already there; not retrieving. File `localhost/test/c.jpg' already there; not retrieving. FINISHED --20:31:41-- Downloaded: 0 bytes in 0 files I think wget 1.10.2 behavior is more correct. Anyway it did not abort in my case. --- Charles
Re: wget aborts when file exists
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Aleksandar Radulovic wrote: (I am not subscribed to the bug-list) Hello, I use wget to retreive recurively images from a site, which are randomly changed on a daily basis . I wrote small batch which worked until sistem upgrade. Now the new version of wget is installed but it aborts when any file already exists. I call it in the following way (I hid the URL and proxies): wget -np -nc -r -l 1 URL I strongly suspect this issue does not appear in the current Wget release (version 1.11). Please obtain a copy of that (or better yet, the prerelease version at http://alpha.gnu.org/gnu/wget/, which fixes some other issues), and see if you can reproduce it. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH1/5A7M8hyUobTrERAjmVAKCJ6P7eunjPaptm80rFc9si7lejWgCfQHoL xTd0dLxEr2odzHcurg+5LqQ= =VpCG -END PGP SIGNATURE-
Re: wget aborts when file exists
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Charles wrote: On Wed, Mar 12, 2008 at 12:46 AM, Aleksandar Radulovic [EMAIL PROTECTED] wrote: (I am not subscribed to the bug-list) Hello, I use wget to retreive recurively images from a site, which are randomly changed on a daily basis . I wrote small batch which worked until sistem upgrade. Now the new version of wget is installed but it aborts when any file already exists. When I tried this in my wget, I got different behavior with wget 1.11 alpha and wget 1.10.2 D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. File `localhost/test/a.gif' already there; not retrieving. File `localhost/test/b.gif' already there; not retrieving. File `localhost/test/c.jpg' already there; not retrieving. FINISHED --20:31:41-- Downloaded: 0 bytes in 0 files I think wget 1.10.2 behavior is more correct. Anyway it did not abort in my case. I think I like the 1.11 behavior (I'm assuming it's intentional). It assums, though, that the preexisting index.html corresponds to the one that you were trying to download; it's unclear to me how wise that is. Hrvoje, are you aware of this change and its rationale? - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH1/7B7M8hyUobTrERAgY+AJ4+uSRpDzUnmgiaWSanFGsFET/BRACfcGnT eRcfOIAHhDvibRn0/EQiAB4= =GH8N -END PGP SIGNATURE-
Re: wget aborts when file exists
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Charles wrote: On Wed, Mar 12, 2008 at 11:03 PM, Micah Cowan [EMAIL PROTECTED] wrote: I think I like the 1.11 behavior (I'm assuming it's intentional). It assums, though, that the preexisting index.html corresponds to the one that you were trying to download; it's unclear to me how wise that is. Hrvoje, are you aware of this change and its rationale? Hi, One drawback of this behavior is that when we mirror a website, and then we cancel it, but the server does not provide last modified header (because the content is dynamically generated, for example), we cannot continue from the point we cancel the download (all the files will have to be downloaded again). If you didn't want that, you probably shouldn't specify -nc, and should instead specify -c. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH2A8x7M8hyUobTrERAvqZAJ9JGvX60DJBheqB/BjiEQh9KIRpPgCbBccX bD/mUv5ee+dRxFXPBZtGE+o= =7fvu -END PGP SIGNATURE-
Re: wget aborts when file exists
Micah Cowan [EMAIL PROTECTED] writes: When I tried this in my wget, I got different behavior with wget 1.11 alpha and wget 1.10.2 D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. File `localhost/test/a.gif' already there; not retrieving. File `localhost/test/b.gif' already there; not retrieving. File `localhost/test/c.jpg' already there; not retrieving. FINISHED --20:31:41-- Downloaded: 0 bytes in 0 files I think wget 1.10.2 behavior is more correct. Anyway it did not abort in my case. I think I like the 1.11 behavior (I'm assuming it's intentional). Let me recap to see if I understand the difference. From the above output, it seems that 1.10's -r descended into an HTML even if it was downloaded. 1.11's -r assumes that if an HTML file is already there, then so are all the other files it references. If this analysis is correct, I don't see the benefit of the new behavior. If index.html happens to be present, it doesn't mean that the files it references are also present. I don't know if the change was intentional, but it looks incorrect to me. It assums, though, that the preexisting index.html corresponds to the one that you were trying to download; it's unclear to me how wise that is. That's what -nc does. But the question is why it assumes that dependent files are also present.
Re: wget aborts when file exists
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: When I tried this in my wget, I got different behavior with wget 1.11 alpha and wget 1.10.2 D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/ File `localhost/test/index.html' already there; not retrieving. File `localhost/test/a.gif' already there; not retrieving. File `localhost/test/b.gif' already there; not retrieving. File `localhost/test/c.jpg' already there; not retrieving. FINISHED --20:31:41-- Downloaded: 0 bytes in 0 files I think wget 1.10.2 behavior is more correct. Anyway it did not abort in my case. I think I like the 1.11 behavior (I'm assuming it's intentional). Let me recap to see if I understand the difference. From the above output, it seems that 1.10's -r descended into an HTML even if it was downloaded. 1.11's -r assumes that if an HTML file is already there, then so are all the other files it references. If this analysis is correct, I don't see the benefit of the new behavior. If index.html happens to be present, it doesn't mean that the files it references are also present. I don't know if the change was intentional, but it looks incorrect to me. Oh. Um, yeah, I think I had it swapped. I was thinking the first example was 1.10.2, and the second 1.11, but judging by the names I'm thinking you're right. In that case, it looks to me like a regression. Thanks, Hrvoje. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH2CGX7M8hyUobTrERAgQ3AJ4hNg/ujDOwhHHUuFPj0WnrnVPDWACgidpw wNx435+A5Gjt4tr2LHxFzqo= =CydB -END PGP SIGNATURE-
Re: wget aborts when file exists
On Thu, Mar 13, 2008 at 1:17 AM, Hrvoje Niksic [EMAIL PROTECTED] wrote: It assums, though, that the preexisting index.html corresponds to the one that you were trying to download; it's unclear to me how wise that is. That's what -nc does. But the question is why it assumes that dependent files are also present. Because I repeated the command, and the files have all been downloaded before. By the way, the index.html contains a link to the three images. I was trying what Alexsandar Ralulovic was reporting.
Re: Wget continue option and buggy webserver
From: Charles In wget 1.10, [...] Have you tried this in something like a current release (1.11, or even 1.10.2)? http://ftp.gnu.org/gnu/wget/ [...] but for some reason (buggy server), [...] How should wget know that it's getting a bogus error from your buggy server, and not getting a valid error from a working server? Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: Wget continue option and buggy webserver
On Feb 19, 2008 11:25 PM, Steven M. Schweda [EMAIL PROTECTED] wrote: From: Charles In wget 1.10, [...] Have you tried this in something like a current release (1.11, or even 1.10.2)? My wget version is 1.10.2. It isn't really a problem for me, I just want to know if this is a known problem or if it is not, whether it could be considered a bug/enhancement. http://ftp.gnu.org/gnu/wget/ [...] but for some reason (buggy server), [...] How should wget know that it's getting a bogus error from your buggy server, and not getting a valid error from a working server? The problem is that the server does not give error. Normal web server like apache gives request range unsatisfied error if we try to request range 1- for a file which size is 1 but this webserver give HTTP 200 OK where wget will happily redownload the file. I think in this case, if the -c switch is given on the command line and the web server returns HTTP 200 OK with content-length header of n bytes and there is already a file with n bytes in the disk, then wget should not redownload the file. I know the problem is with the webserver and not in the side of wget, but sometimes we're dealing with buggy webserver which we don't have control on it.
Re: Wget continue option and buggy webserver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Charles wrote: Hi, In wget 1.10, I observe this behavior: If I execute this command wget -c somefile where somefile is already fully retrieved, but for some reason (buggy server), the webserver return a 400 OK HTTP message instead of request range not satistied; then wget will redownload the file, sometimes producing somefile.1 . So, has there any change to these behavior or can this be filed as bug/enhancement? I'm not sure I understand what it is that you wish Wget to do? (And, I'm assuming you meant 200 OK?) We could have Wget treat 200 OK exactly as 416 Requested Range Not Satisfiable; but then it won't properly handle servers that legitimately do not support byte ranges for some or all files. Wget has no way of knowing that the file had been completely downloaded. It may have been interrupted, or (as mentioned) byte ranges may not be available. It has little choice but to redownload. I suppose at some point in the future, when we have Wget saving download session data to a specified file, it could record that the file was successfully retrieved (which is only possible, of course, for files whose length is known by the server, or which were sent using the chunked Transfer-Encoding), and in that special case treat 200 OK as 416. But we're a long way from recording session data, and it would need further discussion to determine how appropriate this behavior would be. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHuyo27M8hyUobTrERAi/fAJ41xi81cXgC815paf3vlLVPfcvQ/QCdGd0A VQwy2gIuMgRMgYHQsQJB/28= =vHsn -END PGP SIGNATURE-
Re: Wget continue option and buggy webserver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Charles wrote: The problem is that the server does not give error. Normal web server like apache gives request range unsatisfied error if we try to request range 1- for a file which size is 1 but this webserver give HTTP 200 OK where wget will happily redownload the file. I think in this case, if the -c switch is given on the command line and the web server returns HTTP 200 OK with content-length header of n bytes and there is already a file with n bytes in the disk, then wget should not redownload the file. I know the problem is with the webserver and not in the side of wget, but sometimes we're dealing with buggy webserver which we don't have control on it. Hm... that _might_ be okay. Anyone else have thoughts on this? - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHuyvo7M8hyUobTrERAlgAAJwPzwIku5JUyFzHQRURRf+x61m4pQCfRPH7 CPid8ut3NMLv5wIFgRLheCg= =aabn -END PGP SIGNATURE-