Re: Content disposition question
Micah Cowan <[EMAIL PROTECTED]> writes: >> I thought the code was refactored to determine the file name after >> the headers arrive. It certainly looks that way by the output it >> prints: >> >> {mulj}[~]$ wget www.cnn.com >> [...] >> HTTP request sent, awaiting response... 200 OK >> Length: unspecified [text/html] >> Saving to: `index.html' # not "saving to" only after the HTTP response >> >> Where does the extra traffic come from? > > Your example above doesn't set --content-disposition; I'm aware of that, but the above example was supposed to point out the refactoring that has already taken place, regardless of whether --content-disposition is specified. As shown above, Wget always waits for the headers before determining the file name. If that is the case, it would appear that no additional traffic is needed to get Content-Disposition, Wget simply needs to use the information already received. > As to why this is the case, I believe it was so that we could > properly handle accepts/rejects, Issuing another request seems to be the wrong way to go about it, but I haven't thought about it hard enough, so I could be missing a lot of subtleties. >> I am aware that the NEWS entry claims that the feature is experimental, >> but why even mention it if it's not ready for general consumption? >> Announcing experimental features in NEWS is a good way to make testers >> aware of them during the alpha/beta release cycle, but it should be >> avoid in production releases of mature software. > > It's pretty much "good enough"; it's not where I want it, but it > _is_ usable. The extra traffic is really the main reason I don't > want it on-by-default. It should IMHO be documented, then. Even if it's documented as experimental.
Re: Content disposition question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hrvoje Niksic wrote: > Micah Cowan <[EMAIL PROTECTED]> writes: > >> Actually, the reason it is not enabled by default is that (1) it is >> broken in some respects that need addressing, and (2) as it is currently >> implemented, it involves a significant amount of extra traffic, >> regardless of whether the remote end actually ends up using >> Content-Disposition somewhere. > > I'm curious, why is this the case? I thought the code was refactored > to determine the file name after the headers arrive. It certainly > looks that way by the output it prints: > > {mulj}[~]$ wget www.cnn.com > [...] > HTTP request sent, awaiting response... 200 OK > Length: unspecified [text/html] > Saving to: `index.html' # not "saving to" only after the HTTP response > > Where does the extra traffic come from? Your example above doesn't set --content-disposition; if you do, there is an extra HEAD request sent. As to why this is the case, I believe it was so that we could properly handle accepts/rejects, whereas we will otherwise usually assume that we can match accept/reject against the URL itself (we currently do this improperly for the "-nd -r" case, still matching using the generated file name's suffix). Beyond that, I'm not sure as to why, and it's my intention that it not be done in 1.12. Removing it for 1.11 is too much trouble, as the sending-HEAD and sending-GET is not nearly decoupled enough to do it without risk (and indeed, we were seeing trouble where everytime we "fixed" an issue with the send-head-first issue, something else would break). I want to do some reworking of gethttp and http_loop before I will feel comfortable in changing how they work. > If it is not ready for general use, we should consider removing it > from NEWS. I had thought of that. The thing that has kept me from it so far is that it is a feature that is desired by many people, and for most of them, it will work (the issues are pretty minor, and mainly corner-case, except perhaps for the fact that they are apparently always downloaded to the top directory, and not the one in which the URL was found). And, if we leave it out of NEWS and documentation, then, when we answer people who ask "How can I get Wget to respect Content-Disposition headers?", the natural follow-up will be, "Why isn't this mentioned anywhere in the documentation?". :) > If not, it should be properly documented in the manual. Yes... I should be more specific about its shortcomings. > I am aware that the NEWS entry claims that the feature is experimental, > but why even mention it if it's not ready for general consumption? > Announcing experimental features in NEWS is a good way to make testers > aware of them during the alpha/beta release cycle, but it should be > avoid in production releases of mature software. It's pretty much "good enough"; it's not where I want it, but it _is_ usable. The extra traffic is really the main reason I don't want it on-by-default. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHXUFY7M8hyUobTrERAkGPAJwLTDHPqdfP3kIN7Zfxmh8RmjbdMACaA6yG bkKcZfTt0lGpbU79y+AYXF8= =ZHEv -END PGP SIGNATURE-
Re: Content disposition question
Micah Cowan <[EMAIL PROTECTED]> writes: > Actually, the reason it is not enabled by default is that (1) it is > broken in some respects that need addressing, and (2) as it is currently > implemented, it involves a significant amount of extra traffic, > regardless of whether the remote end actually ends up using > Content-Disposition somewhere. I'm curious, why is this the case? I thought the code was refactored to determine the file name after the headers arrive. It certainly looks that way by the output it prints: {mulj}[~]$ wget www.cnn.com [...] HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `index.html' # not "saving to" only after the HTTP response Where does the extra traffic come from? > Note that it is not available at all in any release version of Wget; > only in the current development versions. We will be releasing Wget 1.11 > very shortly, which will include the --content-disposition > functionality; however, this functionality is EXPERIMENTAL only. It > doesn't quite behave properly, and needs some severe adjustments before > it is appropriate to leave as default. If it is not ready for general use, we should consider removing it from NEWS. If not, it should be properly documented in the manual. I am aware that the NEWS entry claims that the feature is experimental, but why even mention it if it's not ready for general consumption? Announcing experimental features in NEWS is a good way to make testers aware of them during the alpha/beta release cycle, but it should be avoid in production releases of mature software. > As to breaking old scripts, I'm not really concerned about that (and > people who read the NEWS file, as anyone relying on previous > behaviors for Wget should do, would just need to set > --no-content-disposition, when the time comes that we enable it by > default). Agreed.
Re: Content disposition question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Matthias Vill wrote: > Hi, > > we know this. This was just recently discussed on the mailinglist and I > agree with you. > But there are two arguments why this is not default: > a) It's a quite new feature for wget and therefore would brake > compatibility with prior versions and any "old" script would need to be > rewritten. > b) It's impossible to pre-guess the filename and thus it is not so well > suited for script-usage. > > I would like to have this feature enabled by some "--interactive" switch > (which could include more options and might be easier to find) or as you > suggested as default with an disable switch. Actually, the reason it is not enabled by default is that (1) it is broken in some respects that need addressing, and (2) as it is currently implemented, it involves a significant amount of extra traffic, regardless of whether the remote end actually ends up using Content-Disposition somewhere. Note that it is not available at all in any release version of Wget; only in the current development versions. We will be releasing Wget 1.11 very shortly, which will include the --content-disposition functionality; however, this functionality is EXPERIMENTAL only. It doesn't quite behave properly, and needs some severe adjustments before it is appropriate to leave as default. As to breaking old scripts, I'm not really concerned about that (and people who read the NEWS file, as anyone relying on previous behaviors for Wget should do, would just need to set --no-content-disposition, when the time comes that we enable it by default). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHVDmo7M8hyUobTrERAtYoAKCR6bYexmpqj5Wud6p9evttgDCMgwCfdoQY oXbPU6EwfQhQhfN0Pi9wC+E= =t6et -END PGP SIGNATURE-
Re: Content disposition question
Hi, we know this. This was just recently discussed on the mailinglist and I agree with you. But there are two arguments why this is not default: a) It's a quite new feature for wget and therefore would brake compatibility with prior versions and any "old" script would need to be rewritten. b) It's impossible to pre-guess the filename and thus it is not so well suited for script-usage. I would like to have this feature enabled by some "--interactive" switch (which could include more options and might be easier to find) or as you suggested as default with an disable switch. Greetings Matthias Vladimir Niksic wrote: > I have noticed that wget doesn't automatically use the option > '--content-disposition'. So what happens is when you download something > from a site that uses content disposition, the resulting file on the > filesystem is not what it should be. > I realize that I could put this option in .wgetrc, but I think that it > would be better if this was the default because the majority of users is > unaware of this option, and cannot hope to find it unless acquainted > with the inner mechanics of HTTP. Also, it's nearly impossible to find. > I've been googling it and finally managed to dig it up from the > documentation. >
Content disposition question
Hi! I have noticed that wget doesn't automatically use the option '--content-disposition'. So what happens is when you download something from a site that uses content disposition, the resulting file on the filesystem is not what it should be. For example, when downloading an Ubuntu torrent from mininova I get: {uragan}[~/tmp]$ wget http://www.mininova.org/get/946879 --2007-12-03 15:58:46-- http://www.mininova.org/get/946879 Resolving www.mininova.org... 87.233.147.140 Connecting to www.mininova.org|87.233.147.140|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 28064 (27K) [application/x-bittorrent] Saving to: `946879' 100%[>] 28,064 87.0K/s in 0.3s 2007-12-03 15:58:47 (87.0 KB/s) - `946879' saved [28064/28064] When use the option --content-disposition: {uragan}[~/tmp]$ wget --content-disposition http://www.mininova.org/get/946879 --2007-12-03 15:59:18-- http://www.mininova.org/get/946879 Resolving www.mininova.org... 87.233.147.140 Connecting to www.mininova.org|87.233.147.140|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 0 [application/x-bittorrent] --2007-12-03 15:59:18-- http://www.mininova.org/get/946879 Connecting to www.mininova.org|87.233.147.140|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 28064 (27K) [application/x-bittorrent] Saving to: `-{mininova.org}- ubuntu-7.10-desktop-i386.iso.torrent' 100%[>] 28,064 47.8K/s in 0.6s 2007-12-03 15:59:19 (47.8 KB/s) - `-{mininova.org}- ubuntu-7.10-desktop-i386.iso.torrent' saved [28064/28064] I realize that I could put this option in .wgetrc, but I think that it would be better if this was the default because the majority of users is unaware of this option, and cannot hope to find it unless acquainted with the inner mechanics of HTTP. Also, it's nearly impossible to find. I've been googling it and finally managed to dig it up from the documentation.