Re: autoconf 2.5x and automake support for wget 1.9 beta
Maciej W. Rozycki [EMAIL PROTECTED] writes: I couldn't send the patches earlier, sorry. Besides what you have already done, I have the following bits within my changes. Thanks, I never would have caught those myself. Do you have suggestions for Autoconf 2.5x features Wget could put to good use?
Re: --disable-dns-cache patch
Jeremy Reeve [EMAIL PROTECTED] writes: Please consider this, my trivial --disable-dns-cache patch for wget. ChangeLog should read something like: 2003-09-07Jeremy S. Reeve [EMAIL PROTECTED] * host.c, init.c, main.c, options.h: Added --disable-dns-cache option to turn off caching of hostname lookups. Thanks for the patch. I'm curious, in what circumstances would one want to use this option? (I'm also asking because of the manual in which I'd like to explain why the option is useful.) Do you agree with dropping the disable from the option name and changing option to `--dns-cache=[on,off]' with the default being on? That way someone who doesn't ever want caching can put `dns_cache = off' to ~/.wgetrc and still override it with `--dns-cache=on' on the command line. The disable* = on reminds me too much of the old do you want to delete all your files (yes means no, no means yes) [yes/no]? . :-)
Re: Retry even when Connection Refused
Ahmon Dancy [EMAIL PROTECTED] writes: I'll apply it shortly. Thanks. Applied now. Is there a wget-announce mailing list? No.
Re: Content-Disposition Take 3
Newman, David [EMAIL PROTECTED] writes: This is my third attempt at a Content-Disposition patch and if it isn't acceptable yet, I'm sure it is pretty close. Thanks. Note that I and other (co-)maintainers have been away for some time, so if your previous attempts have been ignore, it might not have been for lack of quality in your contribution. This patch adds the ability for wget to process the Content-Disposition header. By default wget will ignore the header. However, when used with the --content-disposition option wget will attempt to find a filename stored within the header and use it to store the content. For example, given the URL http://www.maraudingpirates.org/test.php wget will normally set the local filename to test.php However, with the --content-disposition option wget will instead process the header Content-Disposition: attachment; filename=joemama.txt and change the local filename to joemama.txt The thing that worries me about this patch is that in some places Wget's actions depend on transforming the URL to the output file name. I'm having in mind options like `-c' and `-nc'. Won't your patch break those?
Re: Retry even when Connection Refused
Ahmon Dancy [EMAIL PROTECTED] writes: Is there a wget-announce mailing list? No. Alright. Is there a rough estimate for the next release date? I'm thinking of releasing 1.9 with the accumulated features in the current CVS. The code base is IMHO stable enough for that. The only major issue holding it back the release is that configure.in doesn't detect IPv6.
Re: IPv6 detection in configure
Daniel Stenberg [EMAIL PROTECTED] writes: These are two snippets that can be used to detect IPv6 support and a working getaddrinfo() info. Adjust as you see fit! Thanks a bunch! I'll try it out later today.
Re: upcoming new wget version
Jochen Roderburg [EMAIL PROTECTED] writes: Question: Is the often discussed *feature* in version 1.8.x meanwhile repaired, that special characters in local filenames are url-encoded? Hmm, that was another thing scheduled to be fixed for 1.9.
Re: Windows filename patch
Herold Heiko [EMAIL PROTECTED] writes: could you please check the thread Windows filename patch for 1.8.2 from around 24-05-2002 (Hack Kampbjørn, Ian Abbott) ? That patch (url.c) got committed to the 1.8 branch but not to the 1.9 branch. Also, it is comprised of two parts, the first one: Part of the reason it wasn't applied was that I wanted to fix the problem properly for 1.9. I guess I could apply your patch now and remove it if/when the proper fix is in place.
Re: rfc2732 patch for wget
Mauro Tortonesi [EMAIL PROTECTED] writes: On Mon, 8 Sep 2003, Post, Mark K wrote: Absolutely. I would much rather get an intelligent error message stating that ipv6 addresses are not supported, versus a misleading one about the host not being found. That would save end-users a whole lot of wasted time. i agree here. OK then. Here is an additional patch: 2003-09-09 Hrvoje Niksic [EMAIL PROTECTED] * url.c (url_parse): Return an error if the URL contains a [...] IPv6 numeric address and we don't support IPv6. Index: src/url.c === RCS file: /pack/anoncvs/wget/src/url.c,v retrieving revision 1.77 diff -u -r1.77 url.c --- src/url.c 2003/09/05 20:36:17 1.77 +++ src/url.c 2003/09/09 13:02:46 @@ -649,7 +649,9 @@ Invalid user name, #define PE_UNTERMINATED_IPV6_ADDRESS 5 Unterminated IPv6 numeric address, -#define PE_INVALID_IPV6_ADDRESS6 +#define PE_IPV6_NOT_SUPPORTED 6 + IPv6 addresses not supported, +#define PE_INVALID_IPV6_ADDRESS7 Invalid IPv6 numeric address }; @@ -658,6 +660,7 @@ *(p) = (v);\ } while (0) +#ifdef INET6 /* The following two functions were adapted from glibc. */ static int @@ -787,8 +790,8 @@ return 1; } +#endif - /* Parse a URL. Return a new struct url if successful, NULL on error. In case of @@ -860,6 +863,7 @@ return NULL; } +#ifdef INET6 /* Check if the IPv6 address is valid. */ if (!is_valid_ipv6_address(host_b, host_e)) { @@ -869,6 +873,10 @@ /* Continue parsing after the closing ']'. */ p = host_e + 1; +#else + SETERR (error, PE_IPV6_NOT_SUPPORTED); + return NULL; +#endif } else {
IPv6 detection in configure
Thanks to Daniel Stenberg who has either been reading my mind or has had the exact same needs, here is a patch that brings configure (auto-)detection for IPv6. Please test it out on various configurations where IPv6 is or is not enabled. ChangeLog: 2003-09-09 Hrvoje Niksic [EMAIL PROTECTED] * configure.in, aclocal.m4: Added configure check for IPv6 and getaddrinfo. From Daniel Stenberg. src/ChangeLog: 2003-09-09 Hrvoje Niksic [EMAIL PROTECTED] * config.h.in: Initialize HAVE_GETADDRINFO and ENABLE_IPV6. * all: Use #ifdef ENABLE_IPV6 instead of the older INET6. Use HAVE_GETADDRINFO for getaddrinfo-related stuff. Index: aclocal.m4 === RCS file: /pack/anoncvs/wget/aclocal.m4,v retrieving revision 1.6 diff -u -r1.6 aclocal.m4 --- aclocal.m4 2003/09/04 21:29:08 1.6 +++ aclocal.m4 2003/09/09 19:25:07 @@ -86,6 +86,47 @@ AC_MSG_RESULT(no) fi]) +dnl +dnl check for working getaddrinfo() +dnl +AC_DEFUN(WGET_CHECK_WORKING_GETADDRINFO,[ + AC_CACHE_CHECK(for working getaddrinfo, ac_cv_working_getaddrinfo,[ + AC_TRY_RUN( [ +#include netdb.h +#include sys/types.h +#include sys/socket.h + +int main(void) { +struct addrinfo hints, *ai; +int error; + +memset(hints, 0, sizeof(hints)); +hints.ai_family = AF_UNSPEC; +hints.ai_socktype = SOCK_STREAM; +error = getaddrinfo(127.0.0.1, 8080, hints, ai); +if (error) { +exit(1); +} +else { +exit(0); +} +} +],[ + ac_cv_working_getaddrinfo=yes +],[ + ac_cv_working_getaddrinfo=no +],[ + ac_cv_working_getaddrinfo=yes +])]) +if test x$ac_cv_working_getaddrinfo = xyes; then + AC_DEFINE(HAVE_GETADDRINFO, 1, [Define if getaddrinfo exists and works]) + AC_DEFINE(ENABLE_IPV6, 1, [Define if you want to enable IPv6 support]) + + IPV6_ENABLED=1 + AC_SUBST(IPV6_ENABLED) +fi +]) + # This code originates from Ulrich Drepper's AM_WITH_NLS. Index: configure.in === RCS file: /pack/anoncvs/wget/configure.in,v retrieving revision 1.36 diff -u -r1.36 configure.in --- configure.in2003/09/05 19:33:44 1.36 +++ configure.in2003/09/09 19:25:09 @@ -30,7 +30,7 @@ dnl AC_INIT(src/version.c) -AC_PREREQ(2.12) +AC_PREREQ(2.50) AC_CONFIG_HEADER(src/config.h) dnl @@ -155,7 +155,6 @@ AC_C_INLINE AC_TYPE_SIZE_T AC_TYPE_PID_T -dnl This generates a warning. What do I do to shut it up? AC_C_BIGENDIAN # Check size of long. @@ -441,6 +440,55 @@ fi AC_DEFINE(HAVE_MD5) AC_SUBST(MD5_OBJ) + +dnl ** +dnl Checks for IPv6 +dnl ** + +dnl +dnl If --enable-ipv6 is specified, we try to use IPv6 (as long as +dnl getaddrinfo is also present). If --disable-ipv6 is specified, we +dnl don't use IPv6 or getaddrinfo. If neither are specified, we test +dnl whether it's possible to create an AF_INET6 socket and if yes, use +dnl IPv6. +dnl + +AC_MSG_CHECKING([whether to enable ipv6]) +AC_ARG_ENABLE(ipv6, +AC_HELP_STRING([--enable-ipv6],[Enable ipv6 support]) +AC_HELP_STRING([--disable-ipv6],[Disable ipv6 support]), +[ case $enableval in + no) + AC_MSG_RESULT(no) + ipv6=no + ;; + *) AC_MSG_RESULT(yes) + ipv6=yes + ;; + esac ], + + AC_TRY_RUN([ /* is AF_INET6 available? */ +#include sys/types.h +#include sys/socket.h +main() +{ + if (socket(AF_INET6, SOCK_STREAM, 0) 0) + exit(1); + else + exit(0); +} +], + AC_MSG_RESULT(yes) + ipv6=yes, + AC_MSG_RESULT(no) + ipv6=no, + AC_MSG_RESULT(no) + ipv6=no +)) + +if test x$ipv6 = xyes; then + WGET_CHECK_WORKING_GETADDRINFO +fi dnl dnl Set of available languages. Index: src/config.h.in === RCS file: /pack/anoncvs/wget/src/config.h.in,v retrieving revision 1.24 diff -u -r1.24 config.h.in --- src/config.h.in 2002/05/18 02:16:19 1.24 +++ src/config.h.in 2003/09/09 19:25:32 @@ -250,6 +250,12 @@ /* Define if we're using builtin (GNU) md5.c. */ #undef HAVE_BUILTIN_MD5 +/* Define if you have the getaddrinfo function. */ +#undef HAVE_GETADDRINFO + +/* Define if you want to enable the IPv6 support. */ +#undef ENABLE_IPV6 + /* First a gambit to see whether we're on Solaris. We'll need it below. */ #ifdef __sun Index: src/connect.c === RCS file: /pack/anoncvs/wget/src/connect.c,v retrieving revision 1.18 diff -u -r1.18 connect.c --- src/connect.c 2002/05/18 02:16:19 1.18 +++ src/connect.c 2003/09/09 19:25:33 @@ -412,7 +412,7 @@ switch (mysrv.sa.sa_family) { -#ifdef INET6 +#ifdef ENABLE_IPV6 case AF_INET6: memcpy (ip, mysrv.sin6.sin6_addr, 16); return 1; Index: src/ftp-basic.c
Re: --disable-dns-cache patch
Mauro Tortonesi [EMAIL PROTECTED] writes: Thanks for the patch. I'm curious, in what circumstances would one want to use this option? (I'm also asking because of the manual in which I'd like to explain why the option is useful.) e.g., with RFC 3041 temporary ipv6 addresses. Do they really change within a Wget run? Remember that Wget's cache is not written anywhere on disk.
Re: autoconf 2.5 patch for wget
[ I'm Cc-ing the list because this might be interesting to others. ] Mauro Tortonesi [EMAIL PROTECTED] writes: ok, i agree here. but, in order to help me with my work on wget, could you please tell me: * how do you generate a wget tarball for a new release With the script `dist-wget' in the util directory. Ideally the `make dist' target should do the same job, but it gets some things wrong. Take a look at what `dist-wget' does, AFAIR it's pretty clearly written. * how do you generate/maintain gettext-related files (e.g. the files in the po directory The `.po' files are from the translation project. POTFILES.IN is generated by hand when a new `.c' file is added. * how do you generate/maintain libtool-related files (e.g. ltmain.sh) When a new libtool release comes out, ltmain.sh is replaced with the new one and aclocal.m4 is updated with the latest libtool.m4. config.sub and config.guess are updated as needed. * how do you generate/maintain automake-related files (e.g. aclocal.m4, mkinstalldirs, install-sh, etc...) I don't use Automake. mkinstalldirs and install-sh are standard Autoconf stuff that probably hasn't changed for years. If a bug is discovered, you can get the latest version from the latest Autoconf or wherever. it would be impossible for me to keep working on the autoconf-related part of wget without these info. I hope the above helped. There's really not much into it. BTW: could you please tell me what of these changes are acceptable for you: * Re-organized all wget-specific autoconf macros in the config directory As long as you're very careful not to break things, I'm fine with that. But be careful: take into account that Wget doesn't ship with libintl, that it doesn't use Automake, etc. When in doubt, ask. If possible, start with small things. * Re-libtoolized and re-gettextized the package I believe that libtoolization and gettextization are tied with Automake, but I could be wrong. I'm pretty sure that the gettextization process was wrong for Wget. * Updated aclocal.m4, config.guess, config.sub Note that Wget doesn't use a pre-generated (or auto-generated) aclocal.m4. Updating config.guess and config.sub is, of course, fine. * Added IPv6 stack detection to the configuration process Please be careful: Wget doesn't need the kind of stack detection that I've seen in many programs patched to support IPv6. Specifically, I don't want to cater to old buggy or obsolete IPv6 stacks. That's what I liked about Daniel's patch: it was straightforward and seemed to do the trick. If at all possible, go along those lines. * Re-named configure.in to configure.ac and modified the file for better autoconf 2.5x compliance That's fine, as long as it's uncoupled from other changes. Specifically, it should be possible to test all Autoconf-related changes. * Added profiling support to the configure script I'm not sure what you mean here. Why does configure need to be aware of profilers? * Re-named the realclean target to maintainer-clean in the Makefiles for better integration with po/Makefile.in.in and conformance to the de-facto standards That should be fine. * Modified the invocation of config.status in the targets in the Dependencies for maintenance section of Makefile.in, according to the new syntax introduced by autoconf 2.5x I haven't studied the new Autoconf in detail, but I trust that you know what you're doing here. util/Makefile.in: added rmold.pl target, just like texi2pod.pl in doc/Makefile.in src/wget.h: added better handling of HAVE_ALLOCA_H and changed USE_NLS to ENABLE_NLS Sounds fine. BTW what do you mean by better handling of HAVE_ALLOCA_H? Do you actually know that Wget's code was broken on some platforms, or are you just replacing old Autoconf boilerplate code with new one? Thanks for the work you've put in.
Re: autoconf 2.5 patch for wget
Mauro Tortonesi [EMAIL PROTECTED] writes: * how do you generate/maintain gettext-related files (e.g. the files in the po directory The `.po' files are from the translation project. POTFILES.IN is generated by hand when a new `.c' file is added. ok, but what about Makefile.in.in and wget.pot? AFAIR wget.pot is generated by Makefile. (It should probably not be in CVS, though.) Makefile.in.in is not generated, it was originally adapted from the original Makefile.in.in from the gettext distribution. It has served well for years in the current form. * how do you generate/maintain libtool-related files (e.g. ltmain.sh) When a new libtool release comes out, ltmain.sh is replaced with the new one and aclocal.m4 is updated with the latest libtool.m4. config.sub and config.guess are updated as needed. do you mean that you simply copy these files manually from other packages? Yes. I don't do that very often. how do you update aclocal.m4? Wget's aclocal.m4 only contains Wget-specific stuff so it doesn't need special updating. The single exception is, of course, the `libtool.m4' part which needs to be updated along with ltmain.sh, but that is also rare. I really think aclocal.m4 should simply be INCLUDEing libtool.m4, but I wasn't sure how to do that, so I left it at that. (Note that I wasn't the one who introduced libtool to Wget, so it wasn't up to me originally.) please, notice that i am __NOT__ criticizing this. Don't worry, I'm not reading malice in your questions. All your questions are in fact quite valid and responding to them serves to remind myself of why I made the choices I did. I don't use Automake. mkinstalldirs and install-sh are standard Autoconf stuff true. that probably hasn't changed for years. i am not so sure about this. If they've changed and if updating them won't break anything, feel free to update them. (In a separate patch if possible.:-)). * Updated aclocal.m4, config.guess, config.sub Note that Wget doesn't use a pre-generated (or auto-generated) aclocal.m4. Updating config.guess and config.sub is, of course, fine. how do you maintain aclocal.m4, then? by hand? this seems a bit too manual for me :-) I believe Wget's aclocal.m4 is quite different from the ones in Automake-influenced software. I could be wrong, though. Please take another look at it, and please do ignore the libtool stuff which should really be handled with an include. and, more important, with this approach your package may keep using broken/obsoleted autoconf macros without your knowledge. I'm not so sure about that. The way I see it, Wget's configure.in and aclocal.m4 use documented Autoconf macros. Unless Autoconf changes incompatibly (which they shouldn't do without changing the major version), they should keep working. * Added IPv6 stack detection to the configuration process Please be careful: Wget doesn't need the kind of stack detection that I've seen in many programs patched to support IPv6. i am afraid you're wrong here. usagi or kame stack detection is necessary to link the binary to libinet6 (if present). this lets wget use a version of getaddrinfo which is RFC3493-compliant and supports the AI_ALL, AI_ADDRCONFIG (which is __VERY__ important) and AI_V4MAPPED flags. the implementation of getaddrinfo shipped with glibc is not RFC3493-compliant. Shouldn't we simply check for libinet6 in the usual fashion? Furthermore, I don't think that Wget uses any of those flags. Why are should an application that doesn't use them care? Note that I ask this not to annoy you but to learn; you obviously know much more about IPv6 than I do. I have to go now; I'll answer the rest of your message separately. Thanks for your patience and for the detailed reply.
Re: autoconf 2.5 patch for wget
Mauro Tortonesi [EMAIL PROTECTED] writes: AFAIR wget.pot is generated by Makefile. (It should probably not be in CVS, though.) Makefile.in.in is not generated, it was originally adapted from the original Makefile.in.in from the gettext distribution. It has served well for years in the current form. ok. i'll see if the new Makefile.in.in which ships with the latest gettext is worth an upgrade. Note that Wget's Makefile.in.in is likely quite different than the canonical version because of the lack of libintl bundling. That's as it should be. how do you update aclocal.m4? Wget's aclocal.m4 only contains Wget-specific stuff so it doesn't need special updating. The single exception is, of course, the `libtool.m4' part which needs to be updated along with ltmain.sh, but that is also rare. I really think aclocal.m4 should simply be INCLUDEing libtool.m4, but I wasn't sure how to do that, so I left it at that. (Note that I wasn't the one who introduced libtool to Wget, so it wasn't up to me originally.) ok, so you simply take libtool.m4 or maybe only a part of it, and add all wget-specific macros to it. Or the other way around: leave Wget-specific macros and replace libtool.m4 contents. aclocal.m4 has this part: # We embed libtool.m4 from libtool distribution. # -- embedded libtool.m4 begins here -- [ ... contents of libtool.m4 follows ... ] # -- embedded libtool.m4 ends here -- When you need to update libtool.m4, you do the obvious -- replace the old contents of libtool.m4 with the new contents. As I said, it would be even better if it said something like AC_INCLUDE([libtool.m4]) (or whatever the correct syntax is), so you can simply drop in the new libtool.m4 without the need for editing. Shouldn't we simply check for libinet6 in the usual fashion? this could be another solution. but i think it would be much better to do it only for kame and usagi stack. Hmm. Checking for stacks by names is not the Autoconf way. Isn't it better to test for needed features? Daniel's test was written in that spirit. Furthermore, I don't think that Wget uses any of those flags. Why are should an application that doesn't use them care? Note that I ask this not to annoy you but to learn; you obviously know much more about IPv6 than I do. well, it is very important using AI_ADDRCONFIG with getaddrinfo. in this way you get resolution of records only if you have ipv6 working on your host (and, less important, resolution of A records only if you have ipv4 working on your host). dns resolution in a mixed ipv4 and ipv6 environment is a nightmare and AI_ADDRCONFIG can save you a lot of headaches. Very interesting. So what you're saying is that programs that simply follow the getaddrinfo man page (including IPv6-enabled Wget in Debian) don't work in mixed environments? That's really strange.
Re: using host-cache configurable via command line
Patrick Cernko [EMAIL PROTECTED] writes: I discovered a small problem with the increasing number of servers with canching IPs but constant name (provided by Nameservers like dyndns.org). If the download with wget is interrupted by a IP change (e.g. a dialup host whose provider killed the connection), wget retries the download using the previously cached IP. This will fail as the host (specified by its dyndns-Hostname) is no longer reachable via this old IP. Instead he is reachable over a new IP (assigned by its provider). But it is still reachable via its hostname as the host updated the DNS entry with its new IP. So I patched wget to tell it, not to use the cached IPs from earlier but instead do a new host lookup like for the first time connecting the host. Patrick, thanks for the patch and the explanation. A similar change, probably with invocation `--dns-cache=off', is scheduled to appear in the next release. Your contribution is also important because we've been looking for a suitable text for the manual that explains why it is sometimes beneficial to turn off the DNS cache.
Re: autoconf 2.5 patch for wget
Mauro Tortonesi [EMAIL PROTECTED] writes: On Wed, 10 Sep 2003, Hrvoje Niksic wrote: Mauro Tortonesi [EMAIL PROTECTED] writes: Shouldn't we simply check for libinet6 in the usual fashion? this could be another solution. but i think it would be much better to do it only for kame and usagi stack. Hmm. Checking for stacks by names is not the Autoconf way. Isn't it better to test for needed features? Daniel's test was written in that spirit. i think kame or usagi stack detection is not so ugly and works better than the simple detection of libinet6. in fact, if you don't want to perform stack detection, you have to test if libinet6 is installed on the host system __and__ if the getaddrinfo function contained in libinet6 is better that the one shipped with the libc. it is a cleaner (and better) approach, but much more complicated and error prone than stack detection, IMVHO. Isn't the second check a matter of running a small test program, as in the check that Daniel provided (but more sophisticated)? If we absolutely must detect kame and usagi (whatever those are:), we'll do so. But I'd like to be sure that other options have been researched. Furthermore, I don't think that Wget uses any of those flags. Why are should an application that doesn't use them care? Note that I ask this not to annoy you but to learn; you obviously know much more about IPv6 than I do. well, it is very important using AI_ADDRCONFIG with getaddrinfo. in this way you get resolution of records only if you have ipv6 working on your host (and, less important, resolution of A records only if you have ipv4 working on your host). dns resolution in a mixed ipv4 and ipv6 environment is a nightmare and AI_ADDRCONFIG can save you a lot of headaches. Very interesting. So what you're saying is that programs that simply follow the getaddrinfo man page (including IPv6-enabled Wget in Debian) don't work in mixed environments? That's really strange. no, i'm not saying that. i'm saying that if you have a program that calls getaddrinfo on an ipv6(4)-only host you get also A() records with ipv4(6) addresses that you cannot connect to. this may slow down the connection process (if the code is well written), or simply break it (if the code is badly written) and may also cause other subtle problems. Then it sounds like definitely want to use this flag. However, you go on to say: by using the AI_ADDRCONFIG flag with getaddrinfo on a ipv6(4)-only host you get only (A) records. however, i think that a use ipvX only configuration option is a better solution than AI_ADDRCONFIG. Better solution in the sense that we shouldn't use AI_ADDRCONFIG after all? Or that this configuration option should be an alternative to AI_ADDRCONFIG? If the latter is the case, should be a use ipvX only runtime option as well?
Re: autoconf 2.5 patch for wget
Mauro Tortonesi [EMAIL PROTECTED] writes: Isn't the second check a matter of running a small test program, as in the check that Daniel provided (but more sophisticated)? sure. but what was the problem with stack detection? it's simply a couple of AC_EGREP_CPP macros after all... The problem I have with IN6_GUESS_STACK is that it seems to rely on product information, in this case the known stack names. And those things change. So when usagi gets renamed to yojimbo or when we port Wget to a new IPv6-aware architecture, or when a new IPv6 implementation gets added to an existing architecture, we need to update our Autoconf macros. Updating the macros sucks, not only because M4 blows chunks, but also because it means that older source releases of Wget will no longer work. One of the design goals of Autoconf was to avoid the fallacy of older tools that had complex product databases that had to be maintained by hand. Instead, most Autoconf tests try to check for features. The exception are cases when such checks are not possible or feasible. This might or might not be the case here. So if it really takes too long or it's just too hard to write a check, then we'll use a version of IN6_GUESS_STACK. i could start from: http://cvs.deepspace6.net/view/nc6/config/in6_guess_stack.m4?rev=HEADcontent-type=text/vnd.viewcvs-markup and made it much simpler (15-30 lines). what is your opinion about it? Simplifying that code, *and* adding a fallback that handles unknown stacks in a reasonable fashion (for example by assuming minimal functionality or strict standard compliance) sounds fine to me. I'd still prefer a purely feature based check, but again, if you tell me it's hard or impossible to write one, I'll believe you. If the latter is the case, should be a use ipvX only runtime option as well? i think that -4 and -6 command line options for wget are a MUST. the first would make wget use ipv4 only, while the second would make wget use ipv6 only. believe me, there are plenty of cases in which you want to use such options. I agree that those options are useful. And since Wget doesn't currently use numeric-only options, those are available.
Re: bug in wget - wget break on time msec=0
Boehn, Gunnar von [EMAIL PROTECTED] writes: I think I found a bug in wget. You did. But I believe your subject line is slightly incorrect. Wget handles 0 length time intervals (see the assert message), but what it doesn't handle are negative amounts. And indeed: gettimeofday({1063461157, 858103}, NULL) = 0 gettimeofday({1063461157, 858783}, NULL) = 0 gettimeofday({1063461157, 880833}, NULL) = 0 gettimeofday({1063461157, 874729}, NULL) = 0 As you can see, the last gettimeofday returned time *preceding* the one before it. Your ntp daemon must have chosen that precise moment to set back the system clock by ~6 milliseconds, to which Wget reacted badly. Even so, Wget shouldn't crash. The correct fix is to disallow the timer code from ever returning decreasing or negative time intervals. Please let me know if this patch fixes the problem: 2003-09-14 Hrvoje Niksic [EMAIL PROTECTED] * utils.c (wtimer_sys_set): Extracted the code that sets the current time here. (wtimer_reset): Call it. (wtimer_sys_diff): Extracted the code that calculates the difference between two system times here. (wtimer_elapsed): Call it. (wtimer_elapsed): Don't return a value smaller than the previous one, which could previously happen when system time is set back. Instead, reset start time to current time and note the elapsed offset for future calculations. The returned times are now guaranteed to be monotonically nondecreasing. Index: src/utils.c === RCS file: /pack/anoncvs/wget/src/utils.c,v retrieving revision 1.51 diff -u -r1.51 utils.c --- src/utils.c 2002/05/18 02:16:25 1.51 +++ src/utils.c 2003/09/13 23:09:13 @@ -1532,19 +1532,30 @@ # endif #endif /* not WINDOWS */ -struct wget_timer { #ifdef TIMER_GETTIMEOFDAY - long secs; - long usecs; +typedef struct timeval wget_sys_time; #endif #ifdef TIMER_TIME - time_t secs; +typedef time_t wget_sys_time; #endif #ifdef TIMER_WINDOWS - ULARGE_INTEGER wintime; +typedef ULARGE_INTEGER wget_sys_time; #endif + +struct wget_timer { + /* The starting point in time which, subtracted from the current + time, yields elapsed time. */ + wget_sys_time start; + + /* The most recent elapsed time, calculated by wtimer_elapsed(). + Measured in milliseconds. */ + long elapsed_last; + + /* Approximately, the time elapsed between the true start of the + measurement and the time represented by START. */ + long elapsed_pre_start; }; /* Allocate a timer. It is not legal to do anything with a freshly @@ -1577,22 +1588,17 @@ xfree (wt); } -/* Reset timer WT. This establishes the starting point from which - wtimer_elapsed() will return the number of elapsed - milliseconds. It is allowed to reset a previously used timer. */ +/* Store system time to WST. */ -void -wtimer_reset (struct wget_timer *wt) +static void +wtimer_sys_set (wget_sys_time *wst) { #ifdef TIMER_GETTIMEOFDAY - struct timeval t; - gettimeofday (t, NULL); - wt-secs = t.tv_sec; - wt-usecs = t.tv_usec; + gettimeofday (wst, NULL); #endif #ifdef TIMER_TIME - wt-secs = time (NULL); + time (wst); #endif #ifdef TIMER_WINDOWS @@ -1600,39 +1606,76 @@ SYSTEMTIME st; GetSystemTime (st); SystemTimeToFileTime (st, ft); - wt-wintime.HighPart = ft.dwHighDateTime; - wt-wintime.LowPart = ft.dwLowDateTime; + wst-HighPart = ft.dwHighDateTime; + wst-LowPart = ft.dwLowDateTime; #endif } -/* Return the number of milliseconds elapsed since the timer was last - reset. It is allowed to call this function more than once to get - increasingly higher elapsed values. */ +/* Reset timer WT. This establishes the starting point from which + wtimer_elapsed() will return the number of elapsed + milliseconds. It is allowed to reset a previously used timer. */ -long -wtimer_elapsed (struct wget_timer *wt) +void +wtimer_reset (struct wget_timer *wt) { + /* Set the start time to the current time. */ + wtimer_sys_set (wt-start); + wt-elapsed_last = 0; + wt-elapsed_pre_start = 0; +} + +static long +wtimer_sys_diff (wget_sys_time *wst1, wget_sys_time *wst2) +{ #ifdef TIMER_GETTIMEOFDAY - struct timeval t; - gettimeofday (t, NULL); - return (t.tv_sec - wt-secs) * 1000 + (t.tv_usec - wt-usecs) / 1000; + return ((wst1-tv_sec - wst2-tv_sec) * 1000 + + (wst1-tv_usec - wst2-tv_usec) / 1000); #endif #ifdef TIMER_TIME - time_t now = time (NULL); - return 1000 * (now - wt-secs); + return 1000 * (*wst1 - *wst2); #endif #ifdef WINDOWS - FILETIME ft; - SYSTEMTIME st; - ULARGE_INTEGER uli; - GetSystemTime (st); - SystemTimeToFileTime (st, ft); - uli.HighPart = ft.dwHighDateTime; - uli.LowPart = ft.dwLowDateTime; - return (long)((uli.QuadPart - wt-wintime.QuadPart) / 1); + return (long)(wst1-QuadPart - wst2-QuadPart) / 1; #endif +} + +/* Return the number of milliseconds
More flexible URL file name generation
This patch makes URL file name generation a bit more flexible and, hopefully, better for the end-user. It does two things: * Decouples file name quoting from URL quoting. The conflation of the two has been an endless source of annoyance for users. For example, space *has* to be quoted in URLs, but you don't really want to quote it in file names. * Gives the user more control over the quoting mechanism. There are now several quoting levels: --restrict-file-names=none - no restriction, only quote / and \0 --restrict-file-names=unix - quote the above, plus chars in the 0-31 and in the 128-159 range, which are not printable in the shell. --restrict-file-names=windows - quote the above, plus chars disallowed on Windows: \, |, , , ?, :, *, and . The default windows under Windows and Cygwin and unix elsewhere. This patch should supersede the various patches that have been floating around that fix the problem in a limited fashion. Please test this patch and let me know if it works for you, and if something else is needed. 2003-09-14 Hrvoje Niksic [EMAIL PROTECTED] * url.c (append_uri_pathel): Use opt.restrict_file_names when calling file_unsafe_char. * init.c: New command restrict_file_names. * main.c (main): New option --restrict-file-names[=windows,unix]. * url.c (url_file_name): Renamed from url_filename. (url_file_name): Add directory and hostdir prefix here, not in mkstruct. (append_dir_structure): New function, does part of the work that used to be in mkstruct. Iterates over path elements in u-path, calling append_uri_pathel on each one to append it to the file name. (append_uri_pathel): URL-unescape a path element and reencode it with a different set of rules, more appropriate for handling of files. (file_unsafe_char): New function, uses a lookup table to decide whether a character should be escaped for use in file name. (append_string): New utility function. (append_char): Ditto. (file_unsafe_char): New argument restrict_for_windows, decide whether Windows file names should be escaped in run-time. * connect.c: Include stdlib.h to get prototype for abort(). Index: NEWS === RCS file: /pack/anoncvs/wget/NEWS,v retrieving revision 1.38 diff -u -r1.38 NEWS --- NEWS2003/09/10 20:21:13 1.38 +++ NEWS2003/09/14 21:45:48 @@ -7,8 +7,6 @@ * Changes in Wget 1.9. -** The build process now requires Autoconf 2.5x. - ** It is now possible to specify that POST method be used for HTTP requests. For example, `wget --post-data=id=foodata=bar URL' will send a POST request with the specified contents. @@ -32,6 +30,15 @@ ** The new option `--dns-cache=off' may be used to prevent Wget from caching DNS lookups. + +** The build process now requires Autoconf 2.5x. + +** Wget no longer quotes characters in local file names that would be +considered unsafe as part of URL. Quoting can still occur for +control characters or for '/', but no longer for frequent characters +such as space. You can use the new option --restrict-file-names to +enforce even stricter rules, which is useful when downloading to +Windows partitions. * Wget 1.8.1 is a bugfix release with no user-visible changes. Index: doc/wget.texi === RCS file: /pack/anoncvs/wget/doc/wget.texi,v retrieving revision 1.68 diff -u -r1.68 wget.texi --- doc/wget.texi 2003/09/10 19:41:50 1.68 +++ doc/wget.texi 2003/09/14 21:46:10 @@ -800,6 +800,39 @@ If you don't understand the above description, you probably won't need this option. + [EMAIL PROTECTED] file names, restrict [EMAIL PROTECTED] Windows file names [EMAIL PROTECTED] --restrict-file-names=none|unix|windows +Restrict characters that may occur in local file names created by Wget +from remote URLs. Characters that are considered @dfn{unsafe} under a +set of restrictions are escaped, i.e. replaced with @samp{%XX}, where [EMAIL PROTECTED] is the hexadecimal code of the character. + +The default for this option depends on the operating system: on Unix and +Unix-like OS'es, it defaults to ``unix''. Under Windows and Cygwin, it +defaults to ``windows''. Changing the default is useful when you are +using a non-native partition, e.g. when downloading files to a Windows +partition mounted from Linux, or when using NFS-mounted or SMB-mounted +Windows drives. + +When set to ``none'', the only characters that are quoted are those that +are impossible to get into a file name---the NUL character and @samp{/}. +The control characters, newline, etc. are all placed into file names. + +When set to ``unix
Re: wget proxy support
Nicolas, thanks for the patch; I'm about to apply it to Wget CVS.
Re: upcoming new wget version
Hrvoje Niksic [EMAIL PROTECTED] writes: Jochen Roderburg [EMAIL PROTECTED] writes: Question: Is the often discussed *feature* in version 1.8.x meanwhile repaired, that special characters in local filenames are url-encoded? Hmm, that was another thing scheduled to be fixed for 1.9. I believe that the feature has now been fixed. Please try the latest CVS and let me know what you think. BTW URL-escaping special chars in file names is not specific to 1.8.x. All Wget versions until 1.9 have suffered to some extent from file quoting being coupled with URL quoting. It became worse in 1.8.x because it implemented stricter [and more correct] URL escaping rules -- which happened to be even less appropriate for file names.
Re: small doc update patch
Noèl Köthe [EMAIL PROTECTED] writes: Am Mi, 2003-09-10 um 22.21 schrieb Hrvoje Niksic: Just a small patch for the documentation: --- wget-1.8.2.orig/doc/wget.texi +++ wget-1.8.2/doc/wget.texi @@ -507,7 +507,7 @@ @item -t @var{number} @itemx [EMAIL PROTECTED] Set number of retries to @var{number}. Specify 0 or @samp{inf} for -infinite retrying. +infinite retrying. Default (no command-line switch) is not to retry. Huh? The default is to retry 20 times. Isn't it? :-) Hmm, then i got it wrong: $ LC_ALL=C wget -t 0 http://localhost/asdf --00:22:01-- http://localhost/asdf = `asdf' Resolving localhost... done. Connecting to localhost[127.0.0.1]:80... failed: Connection refused. It doesn't for fatal errors, such as connection refused, and --tries doesn't change that. The flag --retry-connrefused, new in CVS, tells Wget to treat connection refused as a non-fatal error.
Re: windows compile error
Herold Heiko [EMAIL PROTECTED] writes: Just a quick note, the current cvs code on windows during compile (with VC++6) stops with cl /I. /DWINDOWS /D_CONSOLE /DHAVE_CONFIG_H /DSYSTEM_WGETRC=\wgetrc\ /DHAVE_SSL /nologo /MT /W0 /O2 /c utils.c utils.c utils.c(1651) : error C2520: conversion from unsigned __int64 to double not implemented, use signed __int64 The culprit seems to be (in wtimer_sys_diff) #ifdef WINDOWS return (double)(wst1-QuadPart - wst2-QuadPart) / 1; #endif Does this patch help? 2003-09-16 Hrvoje Niksic [EMAIL PROTECTED] * utils.c (wtimer_sys_diff): Convert the time difference to signed __int64, then to double. This works around MS VC++ 6 which can't convert unsigned __int64 to double directly. Index: src/utils.c === RCS file: /pack/anoncvs/wget/src/utils.c,v retrieving revision 1.54 diff -u -r1.54 utils.c --- src/utils.c 2003/09/15 21:14:15 1.54 +++ src/utils.c 2003/09/16 21:01:02 @@ -1648,7 +1648,10 @@ #endif #ifdef WINDOWS - return (double)(wst1-QuadPart - wst2-QuadPart) / 1; + /* VC++ 6 doesn't support direct cast of uint64 to double. To work + around this, we subtract, then convert to signed, then finally to + double. */ + return (double)(signed __int64)(wst1-QuadPart - wst2-QuadPart) / 1; #endif }
Re: bug in wget 1.8.1/1.8.2
Dieter Drossmann [EMAIL PROTECTED] writes: I use a extra file with a long list of http entries. I included this file with the -i option. After 154 downloads I got an error message: Segmentation fault. With wget 1.7.1 everything works well. Is there a new limit of lines? No, there's no built-in line limit, what you're seeing is a bug. I cannot see anything wrong inspecting the code, so you'll have to help by providing a gdb backtrace. You can get it by doing this: * Compile Wget with `-g' by running `make CFLAGS=-g' in its source directory (after configure, of course.) * Go to the src/ directory and run that version of Wget the same way you normally run it, e.g. ./wget -i FILE. * When Wget crashes, run `gdb wget core', type `bt' and mail us the resulting stack trace. Thanks for the report.
Re: Incomplete man page on wget
Mitra [EMAIL PROTECTED] writes: Hi, Thanks for the response. I've never used Info before, except for documentation of emacs and very few things are documented there. I suggest it should be presumed that people will look at man wget or wget --help and make sure the documentation is either the same, or that there is a level of indirection to info wget You are right. The current man page does not seem to mention that it is only an excerpt from the entire documentation, and that is a bug. As for Info, note that Wget is a GNU program, and Info is the preferred documentation format of the GNU project.
Re: windows compile error
Herold Heiko [EMAIL PROTECTED] writes: Does compile now, but I managed to produce an application error during a test run on a https site. I produced a debug build with /DDEBUG /Zi /Od /Fd /FR and produced the wget.bsc by running bscmake on all the sbr files, but I didn't yet understand how to use that one in VC++ in order to get a meaningfull stack trace and so on. The only thing I got for now is :SSLEAY32! 0023ca38() as the breaking point. It sounds like an https thing. Is the error repeatable? If so, can you repeat it an earlier CVS snapshot?
Re: Small change to print SSL version
Christopher G. Lewis [EMAIL PROTECTED] writes: Here's a small change to print out the OpenSSL version with the -V --help parameters. [...] I think that GNU Wget something should always stand for Wget's version, regardless of the libraries it has been compiled with. But if you want to see the version of libraries, why not make it clearer, e.g.: GNU Wget x.x.x (compiled with OpenSSL x.x.x) BTW can't you find out OpenSSL version by using `ldd'?
Re: Handling of Content-Length 0
Stefan Eissing [EMAIL PROTECTED] writes: Of course this is only noticable with HTTP/1.1 server which leave the connection open and do not apply transfer-encding: chunked for empty response bodies. They may not apply chunked transfer because Wget doesn't know how to handle it. And leaving the connections open is also Wget's bug because it explicitly requests it. I imagine this should be quite easy to fix... Yes. Patch following RSN.
Re: small doc update patch
Noèl Köthe [EMAIL PROTECTED] writes: -infinite retrying. +infinite retrying. Default (no command-line switch) is to retry +20 times but fatal errors like connection refused or not found +(404) are not being retried. Thanks. I've now committed this: Index: doc/wget.texi === RCS file: /pack/anoncvs/wget/doc/wget.texi,v retrieving revision 1.75 retrieving revision 1.77 diff -u -r1.75 -r1.77 --- doc/wget.texi 2003/09/17 01:32:02 1.75 +++ doc/wget.texi 2003/09/17 21:00:03 1.77 @@ -512,7 +512,9 @@ @item -t @var{number} @itemx [EMAIL PROTECTED] Set number of retries to @var{number}. Specify 0 or @samp{inf} for -infinite retrying. +infinite retrying. The default is to retry 20 times, with the exception +of fatal errors like ``connection refused'' or ``not found'' (404), +which are not retried. @item -O @var{file} @itemx [EMAIL PROTECTED]
Re: windows compile error
Herold Heiko [EMAIL PROTECTED] writes: Repeatable, and it seems to appear with this: 2003-09-15 Hrvoje Niksic [EMAIL PROTECTED] * retr.c (get_contents): Reduce the buffer size to the amount of data that may pass through for one second. This prevents long sleeps when limiting bandwidth. * connect.c (connect_to_one): Reduce the socket's RCVBUF when bandwidth limitation to small values is requested. Previous checkout (checkout -D 23:30 15 sep 2003) wget works fine. I also found a public site which seems to expose the problem (at least from my machine): wget -dv https://www.shavlik.com/pHome.aspx dies after DEBUG output created by Wget 1.9-beta on Windows. [...] Herold, I'm currently having problems obtaining a working SSL build, so I'll need your help with this. Notice that the above change in fact consists of two changes: one to `retr.c', and the other to `connect.c'. Please try to figure out which one is responsible for the crash. Then we'll have a better idea of what to look for.
Re: Handling of Content-Length 0
Stefan Eissing [EMAIL PROTECTED] writes: Please excuse if this bug has already been reported: In wget 1.8.1 (OS X) and 1.8.2 (cygwin) the handling of resources with content-length 0 is wrong. wget tries to read the empty content and hangs until the socket read timeout fires. (I set the timeout to different values and it exactly matches the termination of the GET). Of course this is only noticable with HTTP/1.1 server which leave the connection open and do not apply transfer-encding: chunked for empty response bodies. I've now examined the source code, and I believe Wget handles this case correctly: if keep-alive is in use, it reads only as much data as specified by Content-Length, attempting no read if content-length is 0. The one case I can see might go wrong is that a server leaves a connection hanging without having told Wget it was about to do so. Then Wget will, being a HTTP/1.0 client, try to read all data from the socket regardless of Content-Length. Do you have a URL which you use to repeat this?
Re: minor problem with @xref in documentation
Noèl Köthe [EMAIL PROTECTED] writes: at the end of the description of the option --http-passwd=password: For more information about security issues with Wget, The sentence is incomplete. wget.texi shows: For more information about security issues with Wget, @xref{Security Considerations}. The info page has a correct link. Any idea how to fix this for the manpage? Maybe we should hack texi2pod to change @xref{.*} to see the node \1 of the Info documentation?
Re: windows compile error
Herold Heiko [EMAIL PROTECTED] writes: Found it. Using the 23:00 connect.c and the 23:59 retr.c does produce the bug. Using the 23:59 connect.c and the 23:00 retr.c works ok. This means the problem must be in retr.c . OK, that narrows it down. Two further questions: 1) If you comment out lines 180 and 181 of retr.c, does the problem go away? 1a) How about if you replace line 181 with `dlbufsize = sizeof(dlbuf)'? 2) Do you even specify --limit-rate? If so, to what size?
Re: windows compile error
I've noticed the mistake as soon as I compiled with SSL (and saw the warnings): 2003-09-18 Hrvoje Niksic [EMAIL PROTECTED] * retr.c (get_contents): Pass the correct argument to ssl_iread. Index: src/retr.c === RCS file: /pack/anoncvs/wget/src/retr.c,v retrieving revision 1.57 diff -u -r1.57 retr.c --- src/retr.c 2003/09/15 21:48:43 1.57 +++ src/retr.c 2003/09/18 11:41:56 @@ -191,7 +191,7 @@ ? MIN (expected - *len, dlbufsize) : dlbufsize); #ifdef HAVE_SSL if (rbuf-ssl!=NULL) - res = ssl_iread (rbuf-ssl, dlbufsize, amount_to_read); + res = ssl_iread (rbuf-ssl, dlbuf, amount_to_read); else #endif /* HAVE_SSL */ res = iread (fd, dlbuf, amount_to_read);
Re: protocols directories ?
Herold Heiko [EMAIL PROTECTED] writes: Solution 1: have a switch like --use-protocol-dir = [no|most|all] no would be the current state: ./www.some.site/index.html ./www.some.site/index.html ./www.some.site/index.html all would be: always add a directory level for the protocol: ./http/www.some.site/index.html ./https/www.some.site/index.html ./ftp/www.some.site/index.html That sounds like a good suggestion, except, I'd personally go for a simple yes/no. People who don't need it will never use it, and people who do need it won't mind the all semantics (I think). Plus, *plug*, in the new code, it's dead easy to add. For example, in url_file_name, url.c:1691, you could write: if (opt.add_protocol_dir) append_string (scheme_name (u-scheme), fnres); Implementation of scheme_name is left as an excercise to the reader. :-)
Re: non-recursion
Ilya N. Golubev [EMAIL PROTECTED] writes: Duplicating my [EMAIL PROTECTED] sent on Wed, 10 Sep 2003 19:48:56 +0400 since mailer reports that [EMAIL PROTECTED] does not work. wget -mLd http://www.hro.org/docs/rlex/uk/index.htm does not follow `A HREF=uk1.htm#1' links contained in the resource. That's because Wget thinks those links are part of a huge comment that spans the better part of the document. Unlike most browsers, Wget implements a (too) strict comment parsing, which breaks pages that use non-SGML-compliant comments. As http://www.htmlhelp.com/reference/wilbur/misc/comment.html explains: [...] There is also the problem with the -- sequence. Some people have a habit of using things like !-- as separators in their source. Unfortunately, in most cases, the number of - characters is not a multiple of four. This means that a browser who tries to get it right will actually get it wrong here and actually hide the rest of the document. Currently the only workaround is to alter the source, e.g. by modifying advance_declaration() in html-parse.c. A future version of Wget will probably parse comments in a non-compliant fashion, by considering everything between !-- and -- to be a comment, which is what most other browsers have been doing since the beginnings of the web.
Re: wget renaming URL/file downloaded, how to???
Lucuk, Pete [EMAIL PROTECTED] writes: as we can see above, wget has raznoe.shtml.html as the main file, this is *not* what I want, I *always* want the main file to be name index.html. Wget doesn't really have the concept of a main file. As a workaround, you could simply `ln -s raznoe.shtml.html index.html', and index.html would point to your main file and be available on the web.
Re: non-recursion
Doug Kaufman [EMAIL PROTECTED] writes: On Thu, 18 Sep 2003, Hrvoje Niksic wrote: modifying advance_declaration() in html-parse.c. A future version of Wget will probably parse comments in a non-compliant fashion, by considering everything between !-- and -- to be a comment, which is what most other browsers have been doing since the beginnings of the web. The lynx browser is configurable as to how it parses comments. So is Wget, as of last night. The default is minimal (non-compliant) comment parsing, and that can be changed with `--strict-comments'. It can change on the fly from minimal comments to historical comments to valid comments. Which browsers act in non-compliant fashion all the time? Those that display http://www.hro.org/docs/rlex/uk/index.htm (unless I'm mistaken), and that would mean pretty much all of them. Of course, that page is but one example out of many. Some browsers have more complex heuristics for comment parsing, but adding that to Wget would probably be overdoing it.
Re: Read error (Success) in headers using wget and ssl
Dimitri Ars [EMAIL PROTECTED] writes: I'm having trouble connecting with wget to a site using SSL: [...] I can repeat this, but currently I don't understand enough about SSL to fix it. Christian, could you please help? wget https://145.222.135.165/index.htm --13:46:36-- https://145.222.135.165/index.htm = `index.htm' Connecting to 145.222.135.165:443... connected. HTTP request sent, awaiting response... Read error (Success) in headers. Retrying. --13:46:37-- https://145.222.135.165/index.htm (try: 2) = `index.htm' Connecting to 145.222.135.165:443... connected. HTTP request sent, awaiting response... Read error (Success) in headers. Retrying. --- Expected: Unable to establish SSL connection. because it's using client certificates, but when using the client certificate the same error occurs, so this doesn't seem a clientcertificate problem, thought it might be that wget is having trouble checking that it does need a client certificate ?! Ofcourse using IE as a browser (and the client certificate), no problem... Any idea how to fix this ? I used wget 1.8.2 and a nightly cvs of 20030909, same problem (Please reply directly too as I'm not on the list) Best regards, Dimitri
Re: Any comments on my feature requests ?
Sorry about the lack of response. Your feature requests are quite reasonable, but I have no idea of the timeframe when I'll work on them (they're not a priority for me). Perhaps someone else is interested in helping implement them. The things I planned to tackle for a post-1.9 release are compression support and proper password manager. BTW, have you tried `--http-user' and `--http-passwd'? They're supposed to do pretty much what you describe.
Re: Any comments on my feature requests ?
Mark Veltzer [EMAIL PROTECTED] writes: On Monday 22 September 2003 00:20, you wrote: Sorry about the lack of response. Your feature requests are quite reasonable, but I have no idea of the timeframe when I'll work on them (they're not a priority for me). Perhaps someone else is interested in helping implement them. The things I planned to tackle for a post-1.9 release are compression support and proper password manager. BTW, have you tried `--http-user' and `--http-passwd'? They're supposed to do pretty much what you describe. That's weied. I tried --http-user and --http-passwd and all is working well. According to the documentation the following are equivalent: wget -r --http-user=foo --http-passwd=bar http://my.org and wget -r http://foo:[EMAIL PROTECTED] But they are not. Version 1 works while version 2 doesnt ?!? Does the manual really say that they are equivalent? When you specify `--http-user' and `--http-passwd', they are used for *all* the downloads. When you specify the username and password in a URL, they are used for that URL and not others. That can be considered a bug, but that's how it is.
Re: Any comments on my feature requests ?
Mark Veltzer [EMAIL PROTECTED] writes: In addition I would add a flag that makes the URL method work like the explicit method and vice versa. This would cover all bases. The semantics of that flag aren't as obvious as it may seem. For example, it's completely legal to do this: wget -r http://user1:[EMAIL PROTECTED]/foo/ http://user2:[EMAIL PROTECTED]/bar/
Portable representation of large integers
In these enlightened times when 2G+ or large files are no longer considered large even in the third world, more and more people ask for the ability to download huge files with Wget. Wget carefully uses `long' for potentially large values, such as file sizes and offsets, but that has no effect on the most popular 32-bit architectures, where `long' and `int' are both 32-bit quantities. (It does help on 16-bit architectures where `int' is 16-bit, and it helps under 64-bit LP64 environments where int is 32-bit, but `long' and `long long' are 64-bit.) There have been several attempts to fix this: * The hack called VERY_LONG_TYPE is used to store values that can be reasonably larger than 2G, such as the sum of all downloads. However, on machines without `long long', VERY_LONG_TYPE will be long. Since it is not used for anything critical, that's not much of a problem (and Wget is careful to detect overflows when adding to the sum, so bogus values are not printed.) * SuSE incorporated patches that change Wget's use of `long' to `unsigned long', which upgraded the limit from 2G to 4G. Aside from all the awkwardness that comes from unsigned arithmetic (checking for error conditions with x0 doesn't work; you have to use x==-1), its effect is limited: if I want to download a 3G file today, I'll want to download a 5G file tomorrow. * In its own patches, Debian introduced the use of large file APIs and `long long'. While that's perfectly fine for Debian, it is not portable. Neither the large file API nor `long long' are universally available, and both need thorough configure checking. I believe that large numbers and large files are orthogonal. We need a large numeric type to represent numbers that *could* be large, be it the sum of downloaded bytes, remote file sizes, or local file sizes or offsets. Independently, we need to use large file API where available, to be able to write and read large files locally. Of those two issues, choosing and using the numeric type is the hard one. Autoconf helps only to an extent -- even if you define your own `large_number_t' typedef, which is either `long' or `long long', the question remains how to print that number. Even worse, some systems have `long long' (because they use gcc), but don't support it in libc, so printf can't print it. One way to solve this is to define macros for printing types. For example: #ifdef HAVE_LONG_LONG typedef long long large_number_t; # define LN_PRINT lld #else typedef double large_number_t; # define LN_PRINT f #endif Then this becomes legal code: large_number_t num = 0; printf (The number is: % LN_PRINT !\n, num); Aside from being butt-ugly, this code has two serious problems. 1. Concatenation of adjacent string literals is an ANSI feature and would break pre-ANSI compilers. 2. It breaks gettext. With translation support, the above code would look like this: large_number_t num = 0; printf (_(The number is: % LN_PRINT !\n), num); The message snarfer won't be able to process this because it expects a string literal inside _(...). Even if it were taught about string concatenation, it wouldn't know what to replace LN_PRINT with, unless it ran the preprocessor. And if it ran the preprocessor, it would get non-portable results (ld or f) which cannot be stored to the message catalog. The bottom line is, I really don't know how to solve this portably. Does anyone know how widely ported software deals with large files?
Re: Portable representation of large integers
Maciej W. Rozycki [EMAIL PROTECTED] writes: On Mon, 22 Sep 2003, Hrvoje Niksic wrote: Well, using off_t and AC_SYS_LARGEFILE seems to be the recommended practice. Recommended for POSIX systems, perhaps, but not really portable to older machines. And it doesn't solve the portable printing problem either, so in effect it's about as portable as unconditionally using `long long', which is mandated by C99. I doubt any system that does not support off_t does support LFS. As I mentioned in the first message, LFS is not the only thing you need large values for. Think download quota or the sum of downloaded bytes. You should be able to specify `--quota=10G' on systems without LFS. As for the hassle, remember that Wget caters to systems with much less features than LFS on a regular basis. For example, we suppose pre-ANSI C compilers, libc's without snprintf, strptime or, for that matter, basic C89 functions like memcpy or strstr. So yes, I'd say pre-LFS systems are worth the hassle. Perhaps a good compromise would be to use off_t for variables whose 64-bitness doesn't matter without LFS, and a `large_number_t' typedef that points to either `double' or `long long' for others. Since the others are quite rare, printing them won't be a problem in practice, just like it's not for VERY_LONG_TYPE right now. And even if it does, it's probably not worth the hassle. To handle ordinary old systems, you just call: AC_CHECK_TYPE(off_t, long) before calling AC_SYS_LARGEFILE. That still doesn't explain how to print off_t on systems that don't natively support it. (Or that do, for that matter.)
Re: Please remove me from this alias
Note that this is not an alias, it's a mailing list you must have subscribed to before. (We're not in the spam business just yet, despite certain unfortunate events in the past.) To unsubscribe, please send mail to [EMAIL PROTECTED].
Re: Portable representation of large integers
Daniel Stenberg [EMAIL PROTECTED] writes: On Mon, 22 Sep 2003, Hrvoje Niksic wrote: The bottom line is, I really don't know how to solve this portably. Does anyone know how widely ported software deals with large files? In curl, we provide our own *printf() code that works as expected on all platforms. Lovely. :-) Wget does come with a printf implementation, but it's used only on systems that don't have snprintf, and I'd kind of like it to stay that way. This is one of those wheels that are not that much fun to reinvent. (But then again, I thought exactly the same about hash tables, and I ended up having to roll my own.) (Not that we have proper 2GB support yet anyway, but that's another story. For example, we have to face the problems with exposing an API using such a variable type...) Ah, they joys of writing a library...
Re: Portable representation of large integers
DervishD [EMAIL PROTECTED] writes: Yes, you're true, but... How about using C99 large integer types (intmax_t and family)? But then I can use `long long' just as well, which is supported by C99 and (I think) required to be at least 64 bits wide. Portability is the whole problem, so suggestions that throw portability out the window aren't telling me anything new. Using #ifdefs to switch between %d/%lld/%j *is* completely portable, but it requires three translations for each message. The translators would feast on my flesh, howling at the moonlight. Hmm. How about preprocessing the formats before passing them to printf? For example, always use %j in strings, like this: printf (FORMAT (_(whatever %j\n)), num); On systems that support %j, FORMAT would be defined to no-op. Otherwise, it would be defined to a format_transform function that converts %j to either %lld or %.0f, depending on whether the system has long long or not (in which case it would use double for large quantities). That's the better I can get, because when I wrote portable code, by portable I understand 'according to standards'. For me that means, in that order: SuSv3, POSIX, C99, C89, stop. No pre-ANSI and no brain damaged compilers. I understand your position -- it's perfectly valid, especially when you have the privilege of working on a system that supports all those standards well. But many people don't, and Wget (along with most GNU software of the era) was written to work for them as well. I don't want to support only POSIX systems for the same reason I don't want to support only the GNU system or only the Microsoft systems. For me, portability is not about adhering to standards, it's about making programs work in a wide range of environments, some of which differ from yours. Thanks for your suggestions.
Re: bug maybe?
Randy Paries [EMAIL PROTECTED] writes: Not sure if this is a bug or not. I guess it could be called a bug, although it's no simple oversight. Wget currently doesn't support large files.
Wget 1.9-beta1 is available for testing
After a lot of time of sitting in CVS, a beta of Wget 1.9 is available. To see what's new since 1.8, check the `NEWS' file in the distribution. Get it from: http://fly.srk.fer.hr/~hniksic/wget/wget-1.9-beta1.tar.gz Please test it on as many different platforms as possible and in the places where Wget 1.8.x is currently being used. I expect this release to be extremely stable, but noone can guarantee that without wider testing. I didn't want to call it pre1 or rc1 lest I anger the Gods. One important addition scheduled for 1.9 and *not* featured in this beta are Mauro's IPv6 improvements. When I receive and merge Mauro's changes, I'll release a new beta. As always, thanks for your help.
Re: unsubscribe me now otherwise messages will bounce back to you
To unsubscribe, send mail to [EMAIL PROTECTED].
Re: Wget 1.9-beta1 is available for testing
DervishD [EMAIL PROTECTED] writes: I've got and tested it, and with NO wgetrc (it happens the same with my own wgetrc, but I tested clean just in case), the problem with the quoting still exists: $wget -r -c -nH ftp://user:[EMAIL PROTECTED]/Music/Joe Hisaishi [...] --15:22:55-- ftp://user:[EMAIL PROTECTED]/Music%2fJoe%20Hisaishi/Joe%20Hisaishi = `Music%2FJoe Hisaishi/.listing' Thanks for the detailed bug report. Although it doesn't look that way, this problem is nothing but a simple oversight. (A function that was supposed to URL-encode everything except slashes failed to enforce the exception.) This patch should fix it: 2003-09-24 Hrvoje Niksic [EMAIL PROTECTED] * url.c (url_escape_1): Revert unintentional change to lowercase xdigit escapes. (url_escape_dir): Document that this function depends on the output of url_escape_1. Index: src/url.c === RCS file: /pack/anoncvs/wget/src/url.c,v retrieving revision 1.94 diff -u -r1.94 url.c --- src/url.c 2003/09/22 12:07:20 1.94 +++ src/url.c 2003/09/24 14:10:48 @@ -198,8 +198,8 @@ { unsigned char c = *p1++; *p2++ = '%'; - *p2++ = XNUM_TO_digit (c 4); - *p2++ = XNUM_TO_digit (c 0xf); + *p2++ = XNUM_TO_DIGIT (c 4); + *p2++ = XNUM_TO_DIGIT (c 0xf); } else *p2++ = *p1++; @@ -1130,6 +1130,7 @@ for (; *h; h++, t++) { + /* url_escape_1 having converted '/' to %2F exactly. */ if (*h == '%' h[1] == '2' h[2] == 'F') { *t = '/';
Re: Wget 1.9-beta1 is available for testing
Could the person who sent me the patch for Windows compilers support please resend it? Amidst all the viruses, I accidentally deleted the message before I've had a chance to apply it. Sorry about the mistake.
Re: wget bug
Jack Pavlovsky [EMAIL PROTECTED] writes: It's probably a bug: bug: when downloading wget -mirror ftp://somehost.org/somepath/3acv14~anivcd.mpg, wget saves it as-is, but when downloading wget ftp://somehost.org/somepath/3*, wget saves the files as 3acv14%7Eanivcd.mpg Thanks for the report. The problem here is that Wget tries to be helpful by encoding unsafe characters in file names to %XX, as is done in URLs. Your first example works because of an oversight (!) that actually made Wget behave as you expected. The good news is that the helpfulness has been rethought for the next release and is no longer there, at least not for ordinary characters like ~ and . Try getting the latest CVS sources, they should work better in this regard. (http://wget.sunsite.dk/ explains how to download the source from CVS.)
Re: Windows patches
Thanks for the patch, I've now applied it using the following ChangeLog entry: 2003-09-26 Gisle Vanem [EMAIL PROTECTED] * mswindows.c (read_registry): Removed. (set_sleep_mode): New function. (windows_main_junk): Call it. BTW, unless you want your patch to be reviewed by a wider audience, you might want to send the patch to [EMAIL PROTECTED] instead. This, as well as the ChangeLog policy and some other things, is explained in the PATCHES document at the top level of Wget's distribution.
Re: dificulty with Debian wget bug 137989 patch
jayme [EMAIL PROTECTED] writes: [...] Before anything else, note that the patch originally written for 1.8.2 will need change for 1.9. The change is not hard to make, but it's still needed. The patch didn't make it to canonical sources because it assumes `long long', which is not available on many platforms that Wget supports. The issue will likely be addressed in 1.10. Having said that: I tried the patch Debian bug report 137989 and didnt work. Can anybody explain: 1 - why I have to make to directories for patch work: one wget-1.8.2.orig and one wget-1.8.2 ? You don't. Just enter Wget's source and type `patch -p1 patchfile'. `-p1' makes sure that the top-level directories, such as wget-1.8.2.orig and wget-1.8.2 are stripped when finding files to patch. 2 - why after compilation the wget still cant download the file 2GB ? I suspect you've tried to apply the patch to Wget 1.9-beta, which doesn't work, as explained above.
Wget 1.9-beta2 is available for testing
This beta includes several important bug fixes since 1.9-beta1, most notably the fix for correct file name quoting with recursive FTP downloads. Important Windows fixes by Gisle Vanem and Herold Heiko are also present. Get it from: http://fly.srk.fer.hr/~hniksic/wget/wget-1.9-beta2.tar.gz
Re: Option to save unfollowed links
[ Added Cc to [EMAIL PROTECTED] ] Tony Lewis [EMAIL PROTECTED] writes: The following patch adds a command line option to save any links that are not followed by wget. For example: wget http://www.mysite.com --mirror --unfollowed-links=mysite.links will result in mysite.links containing all URLs that are references to other sites in links on mysite.com. I'm curious: what is the use case for this? Why would you want to save the unfollowed links to an external file?
Submitting a `.pot' file to the Translation Project
Does anyone know the current procedure for submitting the `.pot' file to the GNU Translation Project? At the moment, the project home page at http://www.iro.umontreal.ca/contrib/po/HTML/ appears dead.
Re: Option to save unfollowed links
Tony Lewis [EMAIL PROTECTED] writes: Hrvoje Niksic wrote: I'm curious: what is the use case for this? Why would you want to save the unfollowed links to an external file? I use this to determine what other websites a given website refers to. For example: wget http://directory.google.com/Top/Regional/North_America/United_States/California/Localities/H/Hayward/ - -mirror -np --unfollowed-links=hayward.out By looking at hayward.out, I have a list of all websites that the directory refers to. When I use this file, I sort it and throw away the Google and DMOZ links. Everything else is supposed to be something interesting about Hayward. I see. Hmm.. if you have to post-process the list anyway, wouldn't it be more useful to have a list of *all* encountered URLs? It might be nice to accompany this output with the exit statuses, so people can easily grep for 404's. A comprehensive reporting facility has often been requested. Perhaps something should be done about it for the next release.
Re: Option to save unfollowed links
Tony Lewis [EMAIL PROTECTED] writes: Would something like the following be what you had in mind? 301 http://www.mysite.com/ 200 http://www.mysite.com/index.html 200 http://www.mysite.com/followed.html 401 http://www.mysite.com/needpw.html --- http://www.othersite.com/notfollowed.html Yes, with the possible extensions of file name where the link was saved, sensible status for non-HTTP (currently FTP) links, etc.
Re: downloading files for ftp
Payal Rathod [EMAIL PROTECTED] writes: I have 5-7 user accounts in /home whose data is important. Every day at 12:00 I want to back their data to a differnt backup machine. The remote machine has a ftp server. Can I use wget for this? If yes, how do I proceed? The way to do it with Wget would be something like: wget --mirror --no-host-directories ftp://username:[EMAIL PROTECTED] It will preserve permissions. Having said that, I believe that rsync would be better at this because it's much more careful to correctly transfer a directory tree from point A to point B. (For better transfer of file names, you should also use Wget 1.9 beta and specify `--restrict-file-names=nocontrol'.)
Wget 1.9-beta3 is available for testing
Not many changes from the previous beta. This is for the purposes of the Translation Project, to which I've submitted `wget.pot', and which might wonder where to get the source of a wget-1.9-beta3 from. Get it from: http://fly.srk.fer.hr/~hniksic/wget/wget-1.9-beta3.tar.gz Mauro's IPv6 changes are not in this beta, and they might not make it to 1.9.
Re: downloading files for ftp
Payal Rathod [EMAIL PROTECTED] writes: On Wed, Oct 01, 2003 at 09:26:47PM +0200, Hrvoje Niksic wrote: The way to do it with Wget would be something like: wget --mirror --no-host-directories ftp://username:[EMAIL PROTECTED] But if I run in thru' crontab, where will it store the downloaded files? I want it to store as it is in server 1. It will store them to the current directory. You can either cd to the desired target directory, or use the `-P' flag to specify the directory to Wget.
Re: BUG in --timeout (exit status)
This problem is not specific to timeouts, but to recursive download (-r). When downloading recursively, Wget expects some of the specified downloads to fail and does not propagate that failure to the code that sets the exit status. This unfortunately includes the first download, which should probably be an exception.
Re: Submitting a `.pot' file to the Translation Project
The home page is back, but it says that the TP Robot is dead. I've contacted Martin Loewis, perhaps he'll be able to provide more info.
Re: downloading files for ftp
Payal Rathod [EMAIL PROTECTED] writes: On Thu, Oct 02, 2003 at 12:03:34PM +0200, Hrvoje Niksic wrote: Payal Rathod [EMAIL PROTECTED] writes: On Wed, Oct 01, 2003 at 09:26:47PM +0200, Hrvoje Niksic wrote: The way to do it with Wget would be something like: wget --mirror --no-host-directories ftp://username:[EMAIL PROTECTED] But if I run in thru' crontab, where will it store the downloaded files? I want it to store as it is in server 1. It will store them to the current directory. You can either cd to the desired target directory, or use the `-P' flag to specify the directory to Wget. Thanks a lot. It works wonderfully. But one small thing here. I am trying to use it thru' cron like this, 51 * * * * wget --mirror --no-host-directories -P /home/t1 ftp://root:[EMAIL PROTECTED]//home/payal/qmail* But instead of delivering it to /home/t1, wget makes a directory /home/t1/home/payal and put the qmail* files there. What is the workaround for this? Use `--cut-dirs=2', which will tell Wget to get rid of two levels of directory hierarchy (home and payal).
Re: run_with_timeout() for Windows
Gisle Vanem [EMAIL PROTECTED] writes: I've patched util.c to make run_with_timeout() work on Windows (better than it does with alarm()!). Cool, thanks! Note that, to save the honor of Unix, I've added support for setitimer on systems that support it (virtually everything these days), so run_with_timeout now always works with sub-second precision. Also, I think the Windows-specific implementation of run_with_timeout should be entirely in mswindows.c. The Unix one in utils.c is enough of a soup to add the Windows version as well. Besides, mswindows.c can freely include all the needed headers, use MSVC++ specific constructs, etc. In short it creates and starts a thread, then loops querying the thread exit-code. breaks if != STLL_ACTIVE, else sleep for 0.1 sec. Uses a wget_timer too for added accuracy. The 0.1s sleeps strike me as inefficient. Couldn't you wait for a condition instead? For example: run_with_timeout(...) { initialize condvar (pthread_cond_init) spawn the thread wait on condvar's condition with specified timeout (pthread_cond_timedwait) kill the thread or not, depending on whether the above wait timed out or not. } thread_helper() { call fun(arg) signal the condvar (pthread_cond_signal) } I have a problem with run_with_timeout() returning 1 and hence lookup_host() reporting ETIMEDOUT. Isn't TRY_AGAIN more suited indicating the caller should try a longer timeout? I'm not sure what you mean here. Isn't the whole point of having a DNS timeout for the program to *not* retry with a longer value, but to give up? Or, do you mean that Wget's *_loop functions should treat host lookup failure due to timeout as non-fatal error? + if (seconds 1.0) +seconds = 1.0; Why is this necessary? The alarm() code was doing something similar, but that was to make sure a 0.5s timeout doesn't end up calling alarm(0), which would mean wait forever. BTW why are you setting the stack size to 4096 (bytes?)? It probably doesn't matter in the current implementation, but it might hurt other uses of run_with_timeout. + /* If we timed out kill the thread. Normal thread exitCode would be 0. + */ + if (exitCode == STILL_ACTIVE) + { +DEBUGN (2, (thread timed out\n)); +exitCode = 1; +TerminateThread (thread_hnd, exitCode); +WSASetLastError (ETIMEDOUT); /* overridden by caller */ Why are you setting the error here? The semantics of run_with_timeout are supposed to be that error conditions are determined by whatever FUN was doing. If some X_with_timeout routine wants to set errno to ETIMEDOUT, it can, but it's not run_with_timeout's job to do that.
Re: run_with_timeout() for Windows
I've committed this patch, with minor changes, such as moving the code to mswindows.c. Since I don't have MSVC, someone else will need to check that the code compiles. Please let me know how it goes.
Re: wget 1.9 - behaviour change in recursive downloads
It's a feature. `-A zip' means `-A zip', not `-A zip,html'. Wget downloads the HTML files only because it absolutely has to, in order to recurse through them. After it finds the links in them, it deletes them.
Re: some wget patches against beta3
Thanks for the contribution. Note that a slightly more correct place to send the patch is the [EMAIL PROTECTED] list, followed by people with a keener interest in development. Also, you should send at least a short explanation of what each patch is supposed to do and why one should apply it. (Except in the case of really short, self-explanatory patches, of course.) As for the Polish translation, translations are normally handled through the Translation Project. The TP robot is currently down, but I assume it will be back up soon, and then we'll submit the POT file and update the translations /en masse/.
Re: mswindows.h patch
Thanks for the patch, I've now applied it with the following ChangeLog entry: 2003-10-03 Gisle Vanem [EMAIL PROTECTED] * connect.c: And don't include them here. * mswindows.h: Include winsock headers here. However, I've postponed applying the part that changes `-d'. I agree that `-d' could stand improvement, but let's wait with that until 1.9 is released.
Re: wget 1.9 - behaviour change in recursive downloads
Jochen Roderburg [EMAIL PROTECTED] writes: Zitat von Hrvoje Niksic [EMAIL PROTECTED]: It's a feature. `-A zip' means `-A zip', not `-A zip,html'. Wget downloads the HTML files only because it absolutely has to, in order to recurse through them. After it finds the links in them, it deletes them. Hmm, so it has really been an undetected error over all the years ;-) ? s/undetected/unfixed/ At least I've always considered it an error. I didn't know people depended on it.
Re: run_with_timeout() for Windows
Gisle Vanem [EMAIL PROTECTED] writes: Hrvoje Niksic [EMAIL PROTECTED] said: I've committed this patch, with minor changes, such as moving the code to mswindows.c. Since I don't have MSVC, someone else will need to check that the code compiles. Please let me know how it goes. It compiled it with MSVC okay, but crashed somewhere unrelated. Both before and after my patch. In which code does it crash? Is the crash repeatable? If so, how do you repeat it? Can you see if the same crash occurs in beta1 or beta2 codebase? Thanks.
Re: Bug in Windows binary?
Gisle Vanem [EMAIL PROTECTED] writes: --- mswindows.c.org Mon Sep 29 11:46:06 2003 +++ mswindows.c Sun Oct 05 17:34:48 2003 @@ -306,7 +306,7 @@ DWORD set_sleep_mode (DWORD mode) { HMODULE mod = LoadLibrary (kernel32.dll); - DWORD (*_SetThreadExecutionState) (DWORD) = NULL; + DWORD (WINAPI *_SetThreadExecutionState) (DWORD) = NULL; DWORD rc = (DWORD)-1; I assume Heiko didn't notice it because he doesn't have that function in his kernel32.dll. Heiko and Hrvoje, will you correct this ASAP? I've now applied the patch, thanks. I use the following ChangeLog entry: 2003-10-05 Gisle Vanem [EMAIL PROTECTED] * mswindows.c (set_sleep_mode): Fix type of _SetThreadExecutionState.
Re: subscribe wget
To subscribe to this list, please send mail to [EMAIL PROTECTED].
Re: can wget disable HTTP Location Forward ?
There is currently no way to disable following redirects. A patch to do so has been submitted recently, but I didn't see a good reason why one would need it, so I didn't add the option. Your mail is a good argument, but I don't know how prevalent that behavior is. What is it with servers that can't be bothered to return 404? Are there lots of them nowadays? Is a new default setting of Apache or IIS to blame, or are people intentionally screwing up their configurations?
Re: Web page source using wget?
Tony Lewis [EMAIL PROTECTED] writes: wget http://www.custsite.com/some/page.html --http-user=USER --http-passwd=PASS If you supply your user ID and password via a web form, it will be tricky (if not impossible) because wget doesn't POST forms (unless someone added that option while I wasn't looking. :-) Wget 1.9 can send POST data. But there's a simpler way to handle web sites that use cookies for authorization: make Wget use the site's own cookie. Export cookies as explained in the manual, and specify: wget --load-cookies=COOKIE-FILE http://... Here is an excerpt from the manual section that explains how to export cookies. `--load-cookies FILE' Load cookies from FILE before the first HTTP retrieval. FILE is a textual file in the format originally used by Netscape's `cookies.txt' file. You will typically use this option when mirroring sites that require that you be logged in to access some or all of their content. The login process typically works by the web server issuing an HTTP cookie upon receiving and verifying your credentials. The cookie is then resent by the browser when accessing that part of the site, and so proves your identity. Mirroring such a site requires Wget to send the same cookies your browser sends when communicating with the site. This is achieved by `--load-cookies'--simply point Wget to the location of the `cookies.txt' file, and it will send the same cookies your browser would send in the same situation. Different browsers keep textual cookie files in different locations: Netscape 4.x. The cookies are in `~/.netscape/cookies.txt'. Mozilla and Netscape 6.x. Mozilla's cookie file is also named `cookies.txt', located somewhere under `~/.mozilla', in the directory of your profile. The full path usually ends up looking somewhat like `~/.mozilla/default/SOME-WEIRD-STRING/cookies.txt'. Internet Explorer. You can produce a cookie file Wget can use by using the File menu, Import and Export, Export Cookies. This has been tested with Internet Explorer 5; it is not guaranteed to work with earlier versions. Other browsers. If you are using a different browser to create your cookies, `--load-cookies' will only work if you can locate or produce a cookie file in the Netscape format that Wget expects. If you cannot use `--load-cookies', there might still be an alternative. If your browser supports a cookie manager, you can use it to view the cookies used when accessing the site you're mirroring. Write down the name and value of the cookie, and manually instruct Wget to send those cookies, bypassing the official cookie support: wget --cookies=off --header Cookie: NAME=VALUE
Re: Web page source using wget?
Suhas Tembe [EMAIL PROTECTED] writes: Hello Everyone, I am new to this wget utility, so pardon my ignorance.. Here is a brief explanation of what I am currently doing: 1). I go to our customer's website every day log in using a User Name Password. 2). I click on 3 links before I get to the page I want. 3). I right-click on the page choose view source. It opens it up in Notepad. 4). I save the source to a file subsequently perform various tasks on that file. As you can see, it is a manual process. What I would like to do is automate this process of obtaining the source of a page using wget. Is this possible? Maybe you can give me some suggestions. It's possible, in fact it's what Wget does in its most basic form. Disregarding authentication, the recipe would be: 1) Write down the URL. 2) Type `wget URL' and you get the source of the page in file named SOMETHING.html, where SOMETHING is the file name that the URL ends with. Of course, you will also have to specify the credentials to the page, and Tony explained how to do that.
Wget 1.9-beta4 is available for testing
Several bugs fixed since beta3, including a fatal one on Windows. Includes a working Windows implementation of run_with_timeout. Get it from: http://fly.srk.fer.hr/~hniksic/wget/wget-1.9-beta4.tar.gz
Re: -q and -S are incompatible
Dan Jacobson [EMAIL PROTECTED] writes: -q and -S are incompatible and should perhaps produce errors and be noted thus in the docs. They seem to work as I'd expect -- `-q' tells Wget to print *nothing*, and that's what happens. The output Wget would have generated does contain HTTP headers, among other things, but it never gets printed. BTW, there seems no way to get the -S output, but no progress indicator. -nv, -q kill them both. It's a bug that `-nv' kills `-S' output, I think. P.S. one shouldn't have to confirm each bug submission. Once should be enough. You're right. :-( I'll ask the sunsite people if there's a way to establish some form of white lists...
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: Hrvoje Niksic [EMAIL PROTECTED] writes: As for the Polish translation, translations are normally handled through the Translation Project. The TP robot is currently down, but I assume it will be back up soon, and then we'll submit the POT file and update the translations /en masse/. It took a little bit longer than expected but now, the robot is up and running again. This morning (CET) I installed b3 for translation. However, http://www2.iro.umontreal.ca/~gnutra/registry.cgi?domain=wget still shows `wget-1.8.2.pot' to be the current template for [the] domain. Also, my Croatian translation of 1.9 doesn't seem to have made it in. Is that expected?
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: Also, my Croatian translation of 1.9 doesn't seem to have made it in. Is that expected? Unfortunately, yes. Will you please resubmit it with the subject line updated (IIRC, it's now): TP-Robot wget-1.9-b3.hr.po I'm not sure what b3 is, but the version in the POT file was supposed to be beta3. Was there a misunderstanding somewhere along the line?
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: Hrvoje Niksic [EMAIL PROTECTED] writes: I'm not sure what b3 is, but the version in the POT file was supposed to be beta3. Was there a misunderstanding somewhere along the line? Yes, the robot does not like beta3 as part of the version string. b3 or pre3 are okay. Ouch. Why does the robot care about version names at all?
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: Hrvoje Niksic [EMAIL PROTECTED] writes: Ouch. Why does the robot care about version names at all? It must know about the sequences; this is important for merging issues. IIRC, we have at least these sequences supported by the robot: 1.2 - 1.2.1 - 1.2.2 - 1.3 etc. 1.2 - 1.2a - 1.2b - 1.3 1.2 - 1.3-pre1 - 1.3-pre2 - 1.3 1.2 - 1.3-b1 - 1.3-b2 - 1.3 Thanks for the clarification, Karl. But as a maintainer of a project that tries to use the robot, I must say that I'm not happy about this. If the robot absolutely must be able to collate versions, then it should be smarter about it and support a larger array of formats in use out there. See `dpkg' for an example of how it can be done, although the TP robot certainly doesn't need to do all that `dpkg' does. This way, unless I'm missing something, the robot seems to be in the position to dictate its very narrow-minded versioning scheme to the projects that would only like to use it (the robot). That's really bad. But what's even worse is that something or someone silently changed beta3 to b3 in the POT, and then failed to perform the same change for my translation, which caused it to get dropped without notice. Returning an error that says your version number is unparsable to this piece of software, you must use one of ... instead would be more correct in the long run. Is the robot written in Python? Would you consider it for inclusion if I donated a function that performed the comparison more fully (provided, of course, that the code meets your standards of quality)?
Re: Using chunked transfer for HTTP requests?
Tony Lewis [EMAIL PROTECTED] writes: Hrvoje Niksic wrote: Please be aware that Wget needs to know the size of the POST data in advance. Therefore the argument to @code{--post-file} must be a regular file; specifying a FIFO or something like @file{/dev/stdin} won't work. There's nothing that says you have to read the data after you've started sending the POST. Why not just read the --post-file before constructing the request so that you know how big it is? I don't understand what you're proposing. Reading the whole file in memory is too memory-intensive for large files (one could presumably POST really huge files, CD images or whatever). What the current code does is: determine the file size, send Content-Length, read the file in chunks (up to the promised size) and send those chunks to the server. But that works only with regular files. It would be really nice to be able to say something like: mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin My first impulse was to bemoan Wget's antiquated HTTP code which doesn't understand chunked transfer. But, coming to think of it, even if Wget used HTTP/1.1, I don't see how a client can send chunked requests and interoperate with HTTP/1.0 servers. How do browsers figure out whether they can do a chunked transfer or not? I haven't checked, but I'm 99% convinced that browsers simply don't give a shit about non-regular files.
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: I guess, you as the wget maintainer switched from something supported to the unsupported betaX scheme and now we have something to talk about ;) I had no idea that something as usual as betaX was unsupported. In fact, I believe that bX was added when Francois saw me using it in Wget. :-) Using something different then exactly wget-1.9-b3.de.po will confuse the robot sigh Returning an error that says your version number is unparsable to this piece of software, you must use one of ... instead would be more correct in the long run. Sure. You should have receive a message like this, didn't you? I didn't. Maybe it was an artifact of robot not having worked at the time, though.
Re: Using chunked transfer for HTTP requests?
Stefan Eissing [EMAIL PROTECTED] writes: Am Dienstag, 07.10.03, um 16:36 Uhr (Europe/Berlin) schrieb Hrvoje Niksic: What the current code does is: determine the file size, send Content-Length, read the file in chunks (up to the promised size) and send those chunks to the server. But that works only with regular files. It would be really nice to be able to say something like: mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin That would indeed be nice. Since I'm coming from the WebDAV side of life: does wget allow the use of PUT? No. I haven't checked, but I'm 99% convinced that browsers simply don't give a shit about non-regular files. That's probably true. But have you tried sending without Content-Length and Connection: close and closing the output side of the socket before starting to read the reply from the server? That might work, but it sounds too dangerous to do by default, and too obscure to devote a command-line option to. Besides, HTTP/1.1 *requires* requests with a request-body to provide Conent-Length: For compatibility with HTTP/1.0 applications, HTTP/1.1 requests containing a message-body MUST include a valid Content-Length header field unless the server is known to be HTTP/1.1 compliant.
Re: [PATCH] wget-1.8.2: Portability, plus EBCDIC patch
Martin, thanks for the patch and the detailed report. Note that it might have made more sense to apply the patch to the latest CVS version, which is somewhat different from 1.8.2. I'm really not sure whether to add this patch. On the one hand, it's nice to support as many architectures as possible. But on the other hand, most systems are ASCII. All the systems I've ever seen or worked on have been ASCII. I am fairly certain that I would not be able to support EBCDIC in the long run and that, unless someone were to continually support EBCDIC, the existing support would bitrot away. Is anyone on the Wget list using an EBCDIC system?
Re: Using chunked transfer for HTTP requests?
Tony Lewis [EMAIL PROTECTED] writes: Hrvoje Niksic wrote: I don't understand what you're proposing. Reading the whole file in memory is too memory-intensive for large files (one could presumably POST really huge files, CD images or whatever). I was proposing that you read the file to determine the length, but that was on the assumption that you could read the input twice, which won't work with the example you proposed. In fact, it won't work with anything except regular files and links to them. Can you determine if --post-file is a regular file? Yes. If so, I still think you should just read (or otherwise examine) the file to determine the length. That's how --post-file works now. The problem is that it doesn't work for non-regular files. My first message explains it, or at least tries to. For other types of input, perhaps you want write the input to a temporary file. That would work for short streaming, but would be pretty bad in the mkisofs example. One would expect Wget to be able to stream the data to the server, and that's just not possible if the size needs to be known in advance, which HTTP/1.0 requires.
Re: Major, and seemingly random problems with wget 1.8.2
Josh Brooks [EMAIL PROTECTED] writes: I have noticed very unpredictable behavior from wget 1.8.2 - specifically I have noticed two things: a) sometimes it does not follow all of the links it should b) sometimes wget will follow links to other sites and URLs - when the command line used should not allow it to do that. Thanks for the report. A more detailed response follows below: First, sometimes when you attempt to download a site with -k -m (--convert-links and --mirror) wget will not follow all of the links and will skip some of the files! I have no idea why it does this with some sites and doesn't do it with other sites. Here is an example that I have reproduced on several systems - all with 1.8.2: Links are missed on some sites because of the use of incorrect comments. This has been fixed for Wget 1.9, where a more relaxed comment parsing code is the default. But that's not the case for www.zorg.org/vsound/. www.zorg.org/vsound/ contains this markup: META NAME=ROBOTS CONTENT=NOFOLLOW That explicitly tells robots, such as Wget, not to follow the links in the page. Wget respects this and does not follow the links. You can tell Wget to ignore the robot directives. For me, this works as expected: wget -km -e robots=off http://www.zorg.org/vsound/ You can put `robots=off' in your .wgetrc and this problem will not bother you again. The second problem, and I cannot currently give you an example to try yourself but _it does happen_, is if you use this command line: wget --tries=inf -nH --no-parent --directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf --convert-links --html-extension --user-agent=Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1) www.example.com At first it will act normally, just going over the site in question, but sometimes, you will come back to the terminal and see if grabbing all sorts of pages from totally different sites (!) The only way I've seen it happen is when it follows a redirection to a different site. The redirection is followed because it's considered to be part of the same download. However, further links on the redirected site are not (supposed to be) followed. If you have a repeatable example, please mail it here so we can examine it in more detail.
Re: Web page source using wget?
Suhas Tembe [EMAIL PROTECTED] writes: Thanks everyone for the replies so far.. The problem I am having is that the customer is using ASP Java script. The URL stays the same as I click through the links. URL staying the same is usually a sign of the use of frame, not of ASP and JavaScript. Instead of looking at the URL entry field, try using copy link to clipboard instead of clicking on the last link. Then use Wget on that.
Re: Web page source using wget?
Suhas Tembe [EMAIL PROTECTED] writes: this page contains a drop-down list of our customer's locations. At present, I choose one location from the drop-down list click submit to get the data, which is displayed in a report format. I right-click then choose view source save source to a file. I then choose the next location from the drop-down list, click submit again. I again do a view source save the source to another file and so on for all their locations. It's possible to automate this, but it requires some knowledge of HTML. Basically, you need to look at the form.../form part of the page and find the select tag that defines the drop-down. Assuming that the form looks like this: form action=http://foo.com/customer; method=GET select name=location option value=caCalifornia option value=maMassachussetts ... /select /form you'd automate getting the locations by doing something like: for loc in ca ma ... do wget http://foo.com/customer?location=$loc; done Wget will save the respective sources in files named customer?location=ca, customer?location=ma, etc. But this was only an example. The actual process depends on what's in the form, and it might be considerably more complex than this.
Re: Web page source using wget?
Suhas Tembe [EMAIL PROTECTED] writes: It does look a little complicated This is how it looks: form action=InventoryStatus.asp method=post [...] [...] select name=cboSupplier option value=4541-134289454A/option option value=4542-134289 selected454B/option /select Those are the important parts. It's not hard to submit this form. With Wget 1.9, you can even use the POST method, e.g.: wget http://.../InventoryStatus.asp --post-data \ 'cboSupplier=4541-134289status=allaction-select=Query' \ -O InventoryStatus1.asp wget http://.../InventoryStatus.asp --post-data \ 'cboSupplier=4542-134289status=allaction-select=Query' -O InventoryStatus2.asp It might even work to simply use GET, and retrieve http://.../InventoryStatus.asp?cboSupplier=4541-134289status=allaction-select=Query without the need for `--post-data' or `-O', but that depends on the ASP script that does the processing. The harder part is to automate this process for *any* values in the drop-down list. You might need to use an intermediary Perl script that extracts all the option value=... from the HTML source of the page with the drop-down. Then, from the output of the Perl script, you call Wget as shown above. It's doable, but it takes some work. Unfortunately, I don't know of a (command-line) tool that would make this easier.
Re: some wget patches against beta3
[EMAIL PROTECTED] (Martin v. Löwis) writes: Why do you think the scheme is narrow-minded? Because 1.9-beta3 seems to be a problem. VERSION = ('[.0-9]+-?b[0-9]+' '|[.0-9]+-?dev[0-9]+' '|[.0-9]+-?pre[0-9]+' '|[.0-9]+-?rel[0-9]+' '|[.0-9]+[a-z]?' '|[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]') But that's narrow. Why support 1.9-b3, but not 1.9-beta3 or 1.9-alpha3, or 1.9-rc10? Those and similar version schemes are in wide use. That's really bad. But what's even worse is that something or someone silently changed beta3 to b3 in the POT, and then failed to perform the same change for my translation, which caused it to get dropped without notice. Nothing should get dropped without a notice. [...] I now understand that this could have been an exception due to the outage. But that's how it happened. I sent the translation -- twice -- and it got dropped. Karl told me to resend the translation with a 1.9-b3 version (which I'd never heard of before), so I naturally assumed that the submission had been dropped because of version. Now, since UMontreal has changed the translation@ alias, it might be that some messages were lost during the outage; this is unfortunate, but difficult to correct, as we cannot find out which messages might have lost. Fortunately, most translators know to get a message back from the robot for all submissions, so if they don't get one, they resend. Note that I did resend, but to no avail. My first attempt contained a MIME attachment, which I then found out the robot didn't understand. My second attempt was from po-mode, which should have produced a valid message, except for the version.
Re: wget ipv6 patch
Mauro Tortonesi [EMAIL PROTECTED] writes: so, i am asking you: what do you think of these changes? Overall they look very good! Judging from the patch, a large piece of the work part seems to be in an unexpected place: the FTP code. Here are some remarks I got looking at the patch. It inadvertently undoes the latest fnmatch move. I still don't understand the choice to use sockaddr and sockaddr_storage in a application code. They result in needless casts and (to me) uncomprehensible code. For example, this cast: (unsigned char *)(addr-addr_v4.s_addr) would not be necessary if the address were defined as unsigned char[4]. I don't understand the new PASSIVE flag to lookup_host. In lookup_host, the comment says that you don't need to call getaddrinfo_with_timeout, but then you call getaddrinfo_with_timeout. An oversight? You removed this code: - /* ADDR is defined to be in network byte order, which is what -this returns, so we can just copy it to STORE_IP. However, -on big endian 64-bit architectures the value will be stored -in the *last*, not first four bytes. OFFSET makes sure that -we copy the correct four bytes. */ - int offset = 0; -#ifdef WORDS_BIGENDIAN - offset = sizeof (unsigned long) - sizeof (ip4_address); -#endif But the reason the code is there is that inet_aton is not present on all architectures, whereas inet_addr is. So I used only inet_addr in the IPv4 case, and inet_addr stupidly returned `long', which requires some contortions to copy into a uchar[4] on 64-bit machines. (I see that inet_addr returns `in_addr_t' these days.) If you intend to use inet_aton without checking, there should be a fallback implementation in cmpt.c. I note that you elided TYPE from ip_address if ENABLE_IPV6 is not defined. That (I think) results in code duplication in some places, because the code effectively has to handle the IPv4 case twice: #ifdef ENABLE_IPV6 switch (addr-type) { case IPv6: ... IPv6 handling ... break; case IPv4: ... IPv4 handling ... break; } #else ... IPv4 handling because TYPE is not present without ENABLE_IPV6 ... #endif If it would make your life easier to add TYPE in !ENABLE_IPV6 case, so you can write it more compactly, by all means do it. By more compactly I mean something code like this: switch (addr-type) { #ifdef ENABLE_IPV6 case IPv6: ... IPv6 handling ... break; #endif case IPv4: ... IPv4 handling ... break; }
Re: wget ipv6 patch
Mauro Tortonesi [EMAIL PROTECTED] writes: I still don't understand the choice to use sockaddr and sockaddr_storage in a application code. They result in needless casts and (to me) uncomprehensible code. well, using sockaddr_storage is the right way (TM) to write IPv6 enabled code ;-) Not when the only thing you need is storing the result of a DNS lookup. I've seen the RFC, but I don't agree with it in the case of Wget. In fact, even the RFC states that the data structure is merely a help for writing portable code across multiple address families and platforms. Wget doesn't aim for AF independence, and the alternatives are at least as good for platform independence. For example, this cast: (unsigned char *)(addr-addr_v4.s_addr) would not be necessary if the address were defined as unsigned char[4]. in_addr is the correct structure to store ipv4 addresses. using in_addr instead of unsigned char[4] makes much easier to copy or compare ipv4 addresses. moreover, you don't have to care about the integer size in 64-bits architectures. An IPv4 address is nothing more than a 32-bit quantity. I don't see anything incorrect about using unsigned char[4] for that, and that works perfectly fine on 64-bit architectures. Besides, you seem to be willing to cache the string representation of an IP address. Why is it acceptable to work with a char *, but unacceptable to work with unsigned char[4]? I simply don't see that in_addr is helping anything in host.c's code base. I don't understand the new PASSIVE flag to lookup_host. well, that's a problem. to get a socket address suitable for bind(2), you must call getaddrinfo with the AI_PASSIVE flag set. Why? The current code seems to get by without it. There must be a way to get at the socket address without calling getaddrinfo. are there __REALLY__ systems which do not support inet_aton? their ISVs should be ashamed of themselves... Those systems are very old, possibly predating the very invention of inet_aton. If it would make your life easier to add TYPE in !ENABLE_IPV6 case, so you can write it more compactly, by all means do it. By more compactly I mean something code like this: [...] that's a question i was going to ask you. i supposed you were against adding the type member to ip_address in the IPv4-only case, Maintainability is more important than saving a few bytes per cached IP address, especially since I don't expect the number of cache entries to ever be large enough to make a difference. (If someone downloads from so many addresses that the hash table sizes become a problem, the TYPE member will be the least of his problems.) P.S. please notice that by caching the string representation of IP addresses instead of their network representation, the code could become much more elegant and simple. You said that before, but I don't quite understand why that's the case. It's certainly not the case for IPv4.
Re: some wget patches against beta3
[EMAIL PROTECTED] (Martin v. Löwis) writes: Hrvoje Niksic [EMAIL PROTECTED] writes: VERSION = ('[.0-9]+-?b[0-9]+' '|[.0-9]+-?dev[0-9]+' '|[.0-9]+-?pre[0-9]+' '|[.0-9]+-?rel[0-9]+' '|[.0-9]+[a-z]?' '|[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]') But that's narrow. Why support 1.9-b3, but not 1.9-beta3 or 1.9-alpha3, or 1.9-rc10? Those and similar version schemes are in wide use. Are you requesting the addition of these three formats? Yes, please. To be clear: it would be ideal if the Robot didn't care about versioning at all. But if it really has to, then it should support versioning schemes in wide use.
Re: windows patch for cvs
Thanks for the patch, Herold. I've applied and also added similar fixes for Borland's and Watcom's Makefiles. I've used the following ChangeLog entry: 2003-10-09 Herold Heiko [EMAIL PROTECTED] * windows/Makefile.watcom (OBJS): Ditto. * windows/Makefile.src.bor: Ditto. * windows/wget.dep: Ditto. * windows/Makefile.src: Removed references to fnmatch.c and fnmatch.o.
Re: wget checks timestamp on wrong file
It's a bug. -O currently doesn't work everywhere in should. If you just want to change the directory where Wget operates, the workaround is to use `-P'. E.g.: wget -N ftp://ftp.pld-linux.org/dists/ac/PLD/athlon/PLD/RPMS/packages.dir.mdd -P /root/tmp/ftp_ftp.pld-linux.org.dists.ac.PLD.athlon.PLD.RPMS
Re: wget ipv6 patch
Mauro Tortonesi [EMAIL PROTECTED] writes: and i'm saying that for this task the ideal structure is sockaddr_storage. notice that my code uses sockaddr_storage (typedef'd as wget_sockaddr) only when dealing with socket addresses, not for ip address caching. Now I see. Thanks for clearing it up. An IPv4 address is nothing more than a 32-bit quantity. I don't see anything incorrect about using unsigned char[4] for that, and that works perfectly fine on 64-bit architectures. ok, but in this way you have to call memcmp each time you want to compare two ip addresses and memcpy each time you want to copy an ip address. Well, you can copy addresses with the assignment operator as well, as long as they're in a `struct', as they are in the current code. You do need `memcmp' to compare them, but that's fine with me. i prefer the in_addr approach (and i don't understand why we shouldn't use structures like in_addr and in_addr6 which have been created just to do what we want: storing ip addresses) Because they're complexly defined and hard to read if all you want is to store 4 and 16 bytes of binary data, respectively. however, notice that using unsigned char[4] and unsigned char[16] is a less portable solution and is potentially prone to problems with the alignement of the sockaddr_in and sockaddr_in6 structs. Note that I only propose using unsigned char[N] for internal storing of addresses, such as in Wget's own `struct ip_address'. For talking to system API's, we can and should copy the address to the appropriate sockaddr_* structure. That's how the current code works, and it's quite portable. Besides, you seem to be willing to cache the string representation of an IP address. Why is it acceptable to work with a char *, but unacceptable to work with unsigned char[4]? I simply don't see that in_addr is helping anything in host.c's code base. i would prefer to cache string representation of ip addresses because the ipv6 code would be much simpler and more elegant. I agree. My point was merely to point out that even you yourself believe that struct in_addr* is not the only legitimate way to store an IP address. I don't understand the new PASSIVE flag to lookup_host. well, that's a problem. to get a socket address suitable for bind(2), you must call getaddrinfo with the AI_PASSIVE flag set. Why? The current code seems to get by without it. the problem is when you call lookup_host to get a struct to pass to bind(2). if you use --bind-address=localhost and you don't set the AI_PASSIVE flag, getaddinfo will return the 127.0.0.1 address, which is incorrect. There must be a way to get at the socket address without calling getaddrinfo. not if you want to to use --bind-address=ipv6only.domain.com. I see. I guess we'll have to live with it, one way or the other. Instead of accumulating boolean arguments, lookup_host should probably accept a FLAGS argument, so you can call it with, e.g.: lst = lookup_host (addr, LH_PASSIVE | LH_SILENT); ...