Hello, Recently I had a problem with wget. This application of mine has data spread over several HTTP/1.1 servers that know about the others. A client can query any of the servers, if the server doesn't have the information it will know which other server has the information and it will return an HTTP redirect to the URL on the other server. Some of the queries use POST to specify parameters to which data they want, but those are also subject to receiving HTTP redirects, in which case the POST should be repeated on the next server.
Usually after an HTTP redirect the client will repeat the query with a GET, that's the Post/Redirect/Get pattern (http://en.wikipedia.org/wiki/Post/Redirect/Get) used by web forms to send the user to another web page instead of generating the HTML content on the submit URL. To solve this ambiguity, HTTP/1.1 introduced status code 307 that indicates that the server expects the client to try the next URL but using the same method (see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes#3xx_Redirection, RFC2616 is the one that defines this new code, but unfortunately I found it to be not very explicit about this behaviour). So those redirects I was referring to above are all implemented as 307 Redirects. When using an HTML form in Firefox this works just fine, but I was trying to automate it and I noticed that wget doesn't work with that. I tried curl and saw that curl handles the 307 Redirects correctly, so for the time I had to resort to using curl to implement my scripts for now, which is not ideal since wget is my tool of choice... So I decided to fix the issue in wget, to make it behave like both Firefox and curl, and to respect the "spirit" (if not the "letter") of the RFC. Attached to this e-mail you will find a patch created over the latest (at this time) wget 1.12-2443 from the bzr repository. Also attached to this e-mail there is a tarball with some files to help test the issue. The wget307test.tgz file should be unpacked directly under /var/www (or whatever the Apache root htdocs directory is). There's an .htaccess that will set up all that needs setting up as long as "AllowOverride all" is set for that directory in the main Apache config file. actual.cgi is a Perl CGI that will receive the redirected requests (redirect from /wget307test/redirect.cgi with code 307 also implemented inside .htaccess) and test if it worked or not. testform.html is a form that can be used to test the submit to an URL that will return a 307 Redirect from a web browser such as Firefox. testcurl.sh is a shell script that will do the test using curl (I tested it with curl 7.19.7 and it works). testwget.sh is a shell script that will do the test using wget (I tested it with vanilla 1.12 or even unpatched 1.12-2443 from bzr and it does not work). The output of the CGI (which is what each test displays) is textual and will print a line indicating if the test worked or not based on a submitted parameter (that will be lost if the POST was translated to a GET as in wget's case). It will also print another submitted variable (a little sanity check for the CGI) and which method (GET or POST) was used for the request to the CGI. I also updated the documentation (wget.texi used to generate all others including man page) and the ChangeLog, but I may have forgotten something, feel free to change anything in the patch that you feel could be done better. I hope this helps, and I really hope to see this fix included in the next official release of wget! :-D Keep up the great work building this awesome web client tool! Cheers, Filipe
=== modified file 'ChangeLog' --- ChangeLog 2010-10-24 19:45:30 +0000 +++ ChangeLog 2010-11-20 05:58:09 +0000 @@ -1,3 +1,7 @@ +2010-11-20 Filipe Brandenburger <[email protected]> + + * Respect HTTP/1.1 307 redirect code, by preserving same request method (POST). + 2010-10-24 Jessica McKellar <[email protected]> (tiny change) * NEWS: Mention the change to the the summary for recursive downloads. === modified file 'NEWS' --- NEWS 2010-11-19 17:26:14 +0000 +++ NEWS 2010-11-20 06:05:56 +0000 @@ -24,6 +24,8 @@ ** Print diagnostic messages to stderr, not stdout. +** Support HTTP/1.1 307 redirects keep request method. + ** Do not use an additional HEAD request when --content-disposition is used, but use directly GET. === modified file 'doc/wget.texi' --- doc/wget.texi 2010-10-28 22:20:31 +0000 +++ doc/wget.texi 2010-11-20 06:04:04 +0000 @@ -1467,12 +1467,12 @@ can't know that until it receives a response, which in turn requires the request to have been completed -- a chicken-and-egg problem. -Note: if Wget is redirected after the POST request is completed, it -will not send the POST data to the redirected URL. This is because -URLs that process POST often respond with a redirection to a regular -page, which does not desire or accept POST. It is not completely -clear that this behavior is optimal; if it doesn't work out, it might -be changed in the future. +Note: if Wget is redirected with an HTTP status code other than 307 +after the POST request is completed, it will not send the POST data +to the redirected URL. This is because URLs that process POST often +respond with a redirection to a regular page, which does not desire +or accept POST. To explicitely request a POST after a redirect, an +HTTP/1.1 compliant server should return a 307 redirect status code. This example shows how to log to a server using POST and then proceed to download the desired pages, presumably only accessible to authorized === modified file 'src/http.c' --- src/http.c 2010-11-19 16:14:21 +0000 +++ src/http.c 2010-11-20 05:56:31 +0000 @@ -2319,6 +2319,15 @@ CLOSE_INVALIDATE (sock); xfree_null (type); xfree (head); + /* From RFC2616: The status codes 303 and 307 have + been added for servers that wish to make unambiguously + clear which kind of reaction is expected of the client. + + A 307 should be redirected using the same method, + in other words, a POST should be preserved and not + converted to a GET in that case. */ + if (statcode == HTTP_STATUS_TEMPORARY_REDIRECT) + return NEWLOCATION_KEEP_POST; return NEWLOCATION; } } @@ -2798,6 +2807,7 @@ ret = err; goto exit; case NEWLOCATION: + case NEWLOCATION_KEEP_POST: /* Return the new location to the caller. */ if (!*newloc) { @@ -2808,7 +2818,7 @@ } else { - ret = NEWLOCATION; + ret = err; } goto exit; case RETRUNNEEDED: === modified file 'src/retr.c' --- src/retr.c 2010-10-21 11:27:31 +0000 +++ src/retr.c 2010-11-20 05:50:57 +0000 @@ -763,7 +763,7 @@ proxy_url = NULL; } - location_changed = (result == NEWLOCATION); + location_changed = (result == NEWLOCATION || result == NEWLOCATION_KEEP_POST); if (location_changed) { char *construced_newloc; @@ -837,12 +837,17 @@ } u = newloc_parsed; - /* If we're being redirected from POST, we don't want to POST + /* If we're being redirected from POST, and we received a + redirect code different than 307, we don't want to POST again. Many requests answer POST with a redirection to an index page; that redirection is clearly a GET. We "suspend" POST data for the duration of the redirections, and restore - it when we're done. */ - if (!post_data_suspended) + it when we're done. + + RFC2616 HTTP/1.1 introduces code 307 Temporary Redirect + specifically to preserve the method of the request. + */ + if (result != NEWLOCATION_KEEP_POST && !post_data_suspended) SUSPEND_POST_DATA; goto redirected; === modified file 'src/wget.h' --- src/wget.h 2010-09-29 11:34:09 +0000 +++ src/wget.h 2010-11-20 03:46:19 +0000 @@ -352,7 +352,7 @@ PROXERR, /* 50 */ AUTHFAILED, QUOTEXC, WRITEFAILED, SSLINITFAILED, VERIFCERTERR, - UNLINKERR + UNLINKERR, NEWLOCATION_KEEP_POST } uerr_t; /* 2005-02-19 SMS.
wget307test.tgz
Description: GNU Zip compressed data
