from:"Hrvoje Niksic"

Re: autoconf 2.5x and automake support for wget 1.9 beta

2003-09-07 Thread Hrvoje Niksic

Maciej W. Rozycki [EMAIL PROTECTED] writes:

 I couldn't send the patches earlier, sorry.  Besides what you have
 already done, I have the following bits within my changes.

Thanks, I never would have caught those myself.  Do you have
suggestions for Autoconf 2.5x features Wget could put to good use?

Re: --disable-dns-cache patch

2003-09-07 Thread Hrvoje Niksic

Jeremy Reeve [EMAIL PROTECTED] writes:

 Please consider this, my trivial --disable-dns-cache patch for wget.

 ChangeLog should read something like:

 2003-09-07Jeremy S. Reeve [EMAIL PROTECTED]
   *   host.c, init.c, main.c, options.h:  Added --disable-dns-cache 
 option
 to turn off caching of hostname lookups.

Thanks for the patch.  I'm curious, in what circumstances would one
want to use this option?  (I'm also asking because of the manual in
which I'd like to explain why the option is useful.)

Do you agree with dropping the disable from the option name and
changing option to `--dns-cache=[on,off]' with the default being on?
That way someone who doesn't ever want caching can put `dns_cache =
off' to ~/.wgetrc and still override it with `--dns-cache=on' on the
command line.

The disable* = on reminds me too much of the old do you want to
delete all your files (yes means no, no means yes) [yes/no]? .  :-)

Re: Retry even when Connection Refused

2003-09-07 Thread Hrvoje Niksic

Ahmon Dancy [EMAIL PROTECTED] writes:

 I'll apply it shortly.

 Thanks.

Applied now.

 Is there a wget-announce mailing list?

No.

Re: Content-Disposition Take 3

2003-09-08 Thread Hrvoje Niksic

Newman, David [EMAIL PROTECTED] writes:

 This is my third attempt at a Content-Disposition patch and if it
 isn't acceptable yet, I'm sure it is pretty close.

Thanks.  Note that I and other (co-)maintainers have been away for
some time, so if your previous attempts have been ignore, it might not
have been for lack of quality in your contribution.

 This patch adds the ability for wget to process the
 Content-Disposition header.  By default wget will ignore the header.
 However, when used with the --content-disposition option wget will
 attempt to find a filename stored within the header and use it to
 store the content.

 For example, given the URL
 http://www.maraudingpirates.org/test.php

 wget will normally set the local filename to test.php

 However, with the --content-disposition option wget will
 instead process the header

 Content-Disposition: attachment; filename=joemama.txt

 and change the local filename to joemama.txt

The thing that worries me about this patch is that in some places
Wget's actions depend on transforming the URL to the output file
name.  I'm having in mind options like `-c' and `-nc'.  Won't your
patch break those?

Re: Retry even when Connection Refused

2003-09-08 Thread Hrvoje Niksic

Ahmon Dancy [EMAIL PROTECTED] writes:

  Is there a wget-announce mailing list?
 
 No.

 Alright.  Is there a rough estimate for the next release date?

I'm thinking of releasing 1.9 with the accumulated features in the
current CVS.  The code base is IMHO stable enough for that.  The only
major issue holding it back the release is that configure.in doesn't
detect IPv6.

Re: IPv6 detection in configure

2003-09-09 Thread Hrvoje Niksic

Daniel Stenberg [EMAIL PROTECTED] writes:

 These are two snippets that can be used to detect IPv6 support and a
 working getaddrinfo() info. Adjust as you see fit!

Thanks a bunch!  I'll try it out later today.

Re: upcoming new wget version

2003-09-09 Thread Hrvoje Niksic

Jochen Roderburg [EMAIL PROTECTED] writes:

 Question: Is the often discussed *feature* in version 1.8.x meanwhile
 repaired, that special characters in local filenames are
 url-encoded?

Hmm, that was another thing scheduled to be fixed for 1.9.

Re: Windows filename patch

2003-09-09 Thread Hrvoje Niksic

Herold Heiko [EMAIL PROTECTED] writes:

 could you please check the thread Windows filename patch for 1.8.2
 from around 24-05-2002 (Hack Kampbjørn, Ian Abbott) ?  That patch
 (url.c) got committed to the 1.8 branch but not to the 1.9 branch.
 Also, it is comprised of two parts, the first one:

Part of the reason it wasn't applied was that I wanted to fix the
problem properly for 1.9.  I guess I could apply your patch now and
remove it if/when the proper fix is in place.

Re: rfc2732 patch for wget

2003-09-09 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

 On Mon, 8 Sep 2003, Post, Mark K wrote:

 Absolutely.  I would much rather get an intelligent error message
 stating that ipv6 addresses are not supported, versus a misleading
 one about the host not being found.  That would save end-users a
 whole lot of wasted time.

 i agree here.

OK then.  Here is an additional patch:

2003-09-09  Hrvoje Niksic  [EMAIL PROTECTED]

* url.c (url_parse): Return an error if the URL contains a [...]
IPv6 numeric address and we don't support IPv6.

Index: src/url.c
===
RCS file: /pack/anoncvs/wget/src/url.c,v
retrieving revision 1.77
diff -u -r1.77 url.c
--- src/url.c   2003/09/05 20:36:17 1.77
+++ src/url.c   2003/09/09 13:02:46
@@ -649,7 +649,9 @@
   Invalid user name,
 #define PE_UNTERMINATED_IPV6_ADDRESS   5
   Unterminated IPv6 numeric address,
-#define PE_INVALID_IPV6_ADDRESS6
+#define PE_IPV6_NOT_SUPPORTED  6
+  IPv6 addresses not supported,
+#define PE_INVALID_IPV6_ADDRESS7
   Invalid IPv6 numeric address
 };
 
@@ -658,6 +660,7 @@
 *(p) = (v);\
 } while (0)
 
+#ifdef INET6
 /* The following two functions were adapted from glibc. */
 
 static int
@@ -787,8 +790,8 @@
 
   return 1;
 }
+#endif
 
-
 /* Parse a URL.
 
Return a new struct url if successful, NULL on error.  In case of
@@ -860,6 +863,7 @@
  return NULL;
}
 
+#ifdef INET6
   /* Check if the IPv6 address is valid. */
   if (!is_valid_ipv6_address(host_b, host_e))
{
@@ -869,6 +873,10 @@
 
   /* Continue parsing after the closing ']'. */
   p = host_e + 1;
+#else
+  SETERR (error, PE_IPV6_NOT_SUPPORTED);
+  return NULL;
+#endif
 }
   else
 {

IPv6 detection in configure

2003-09-09 Thread Hrvoje Niksic

Thanks to Daniel Stenberg who has either been reading my mind or has
had the exact same needs, here is a patch that brings configure
(auto-)detection for IPv6.

Please test it out on various configurations where IPv6 is or is not
enabled.

ChangeLog:
2003-09-09  Hrvoje Niksic  [EMAIL PROTECTED]

* configure.in, aclocal.m4: Added configure check for IPv6 and
getaddrinfo.  From Daniel Stenberg.

src/ChangeLog:
2003-09-09  Hrvoje Niksic  [EMAIL PROTECTED]

* config.h.in: Initialize HAVE_GETADDRINFO and ENABLE_IPV6.

* all: Use #ifdef ENABLE_IPV6 instead of the older INET6.  Use
HAVE_GETADDRINFO for getaddrinfo-related stuff.

Index: aclocal.m4
===
RCS file: /pack/anoncvs/wget/aclocal.m4,v
retrieving revision 1.6
diff -u -r1.6 aclocal.m4
--- aclocal.m4  2003/09/04 21:29:08 1.6
+++ aclocal.m4  2003/09/09 19:25:07
@@ -86,6 +86,47 @@
   AC_MSG_RESULT(no)
 fi])
 
+dnl 
+dnl check for working getaddrinfo()
+dnl
+AC_DEFUN(WGET_CHECK_WORKING_GETADDRINFO,[
+  AC_CACHE_CHECK(for working getaddrinfo, ac_cv_working_getaddrinfo,[
+  AC_TRY_RUN( [
+#include netdb.h
+#include sys/types.h
+#include sys/socket.h
+
+int main(void) {
+struct addrinfo hints, *ai;
+int error;
+
+memset(hints, 0, sizeof(hints));
+hints.ai_family = AF_UNSPEC;
+hints.ai_socktype = SOCK_STREAM;
+error = getaddrinfo(127.0.0.1, 8080, hints, ai);
+if (error) {
+exit(1);
+}
+else {
+exit(0);
+}
+}
+],[
+  ac_cv_working_getaddrinfo=yes
+],[
+  ac_cv_working_getaddrinfo=no
+],[
+  ac_cv_working_getaddrinfo=yes
+])])
+if test x$ac_cv_working_getaddrinfo = xyes; then
+  AC_DEFINE(HAVE_GETADDRINFO, 1, [Define if getaddrinfo exists and works])
+  AC_DEFINE(ENABLE_IPV6, 1, [Define if you want to enable IPv6 support])
+
+  IPV6_ENABLED=1
+  AC_SUBST(IPV6_ENABLED)
+fi
+])
+
 
 # This code originates from Ulrich Drepper's AM_WITH_NLS.
 
Index: configure.in
===
RCS file: /pack/anoncvs/wget/configure.in,v
retrieving revision 1.36
diff -u -r1.36 configure.in
--- configure.in2003/09/05 19:33:44 1.36
+++ configure.in2003/09/09 19:25:09
@@ -30,7 +30,7 @@
 dnl
 
 AC_INIT(src/version.c)
-AC_PREREQ(2.12)
+AC_PREREQ(2.50)
 AC_CONFIG_HEADER(src/config.h)
 
 dnl
@@ -155,7 +155,6 @@
 AC_C_INLINE
 AC_TYPE_SIZE_T
 AC_TYPE_PID_T
-dnl  This generates a warning.  What do I do to shut it up?
 AC_C_BIGENDIAN
 
 # Check size of long.
@@ -441,6 +440,55 @@
 fi
 AC_DEFINE(HAVE_MD5)
 AC_SUBST(MD5_OBJ)
+
+dnl **
+dnl Checks for IPv6
+dnl **
+
+dnl
+dnl If --enable-ipv6 is specified, we try to use IPv6 (as long as
+dnl getaddrinfo is also present).  If --disable-ipv6 is specified, we
+dnl don't use IPv6 or getaddrinfo.  If neither are specified, we test
+dnl whether it's possible to create an AF_INET6 socket and if yes, use
+dnl IPv6.
+dnl
+
+AC_MSG_CHECKING([whether to enable ipv6])
+AC_ARG_ENABLE(ipv6,
+AC_HELP_STRING([--enable-ipv6],[Enable ipv6 support])
+AC_HELP_STRING([--disable-ipv6],[Disable ipv6 support]),
+[ case $enableval in
+  no)
+   AC_MSG_RESULT(no)
+   ipv6=no
+   ;;
+  *)   AC_MSG_RESULT(yes)
+   ipv6=yes
+   ;;
+  esac ],
+
+  AC_TRY_RUN([ /* is AF_INET6 available? */
+#include sys/types.h
+#include sys/socket.h
+main()
+{
+ if (socket(AF_INET6, SOCK_STREAM, 0)  0)
+   exit(1);
+ else
+   exit(0);
+}
+],
+  AC_MSG_RESULT(yes)
+  ipv6=yes,
+  AC_MSG_RESULT(no)
+  ipv6=no,
+  AC_MSG_RESULT(no)
+  ipv6=no
+))
+
+if test x$ipv6 = xyes; then
+  WGET_CHECK_WORKING_GETADDRINFO
+fi
 
 dnl
 dnl Set of available languages.
Index: src/config.h.in
===
RCS file: /pack/anoncvs/wget/src/config.h.in,v
retrieving revision 1.24
diff -u -r1.24 config.h.in
--- src/config.h.in 2002/05/18 02:16:19 1.24
+++ src/config.h.in 2003/09/09 19:25:32
@@ -250,6 +250,12 @@
 /* Define if we're using builtin (GNU) md5.c.  */
 #undef HAVE_BUILTIN_MD5
 
+/* Define if you have the getaddrinfo function.  */
+#undef HAVE_GETADDRINFO
+
+/* Define if you want to enable the IPv6 support.  */
+#undef ENABLE_IPV6
+
 /* First a gambit to see whether we're on Solaris.  We'll
need it below.  */
 #ifdef __sun
Index: src/connect.c
===
RCS file: /pack/anoncvs/wget/src/connect.c,v
retrieving revision 1.18
diff -u -r1.18 connect.c
--- src/connect.c   2002/05/18 02:16:19 1.18
+++ src/connect.c   2003/09/09 19:25:33
@@ -412,7 +412,7 @@
 
   switch (mysrv.sa.sa_family)
 {
-#ifdef INET6
+#ifdef ENABLE_IPV6
 case AF_INET6:
   memcpy (ip, mysrv.sin6.sin6_addr, 16);
   return 1;
Index: src/ftp-basic.c

Re: --disable-dns-cache patch

2003-09-10 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

 Thanks for the patch.  I'm curious, in what circumstances would one
 want to use this option?  (I'm also asking because of the manual in
 which I'd like to explain why the option is useful.)

 e.g., with RFC 3041 temporary ipv6 addresses.

Do they really change within a Wget run?  Remember that Wget's cache
is not written anywhere on disk.

Re: autoconf 2.5 patch for wget

2003-09-10 Thread Hrvoje Niksic

[ I'm Cc-ing the list because this might be interesting to others. ]

Mauro Tortonesi [EMAIL PROTECTED] writes:

 ok, i agree here. but, in order to help me with my work on wget, could
 you please tell me:

  * how do you generate a wget tarball for a new release

With the script `dist-wget' in the util directory.  Ideally the `make
dist' target should do the same job, but it gets some things wrong.
Take a look at what `dist-wget' does, AFAIR it's pretty clearly
written.

  * how do you generate/maintain gettext-related files (e.g. the files in
the po directory

The `.po' files are from the translation project.  POTFILES.IN is
generated by hand when a new `.c' file is added.

  * how do you generate/maintain libtool-related files
(e.g. ltmain.sh)

When a new libtool release comes out, ltmain.sh is replaced with the
new one and aclocal.m4 is updated with the latest libtool.m4.
config.sub and config.guess are updated as needed.

  * how do you generate/maintain automake-related files
(e.g. aclocal.m4, mkinstalldirs, install-sh, etc...)

I don't use Automake.  mkinstalldirs and install-sh are standard
Autoconf stuff that probably hasn't changed for years.  If a bug is
discovered, you can get the latest version from the latest Autoconf or
wherever.

 it would be impossible for me to keep working on the
 autoconf-related part of wget without these info.

I hope the above helped.  There's really not much into it.

 BTW: could you please tell me what of these changes are acceptable
 for you:

 * Re-organized all wget-specific autoconf macros in the config
   directory

As long as you're very careful not to break things, I'm fine with
that.  But be careful: take into account that Wget doesn't ship with
libintl, that it doesn't use Automake, etc.  When in doubt, ask.  If
possible, start with small things.

 * Re-libtoolized and re-gettextized the package

I believe that libtoolization and gettextization are tied with
Automake, but I could be wrong.  I'm pretty sure that the
gettextization process was wrong for Wget.

 * Updated aclocal.m4, config.guess, config.sub

Note that Wget doesn't use a pre-generated (or auto-generated)
aclocal.m4.  Updating config.guess and config.sub is, of course, fine.

 * Added IPv6 stack detection to the configuration process

Please be careful: Wget doesn't need the kind of stack detection that
I've seen in many programs patched to support IPv6.  Specifically, I
don't want to cater to old buggy or obsolete IPv6 stacks.

That's what I liked about Daniel's patch: it was straightforward and
seemed to do the trick.  If at all possible, go along those lines.

 * Re-named configure.in to configure.ac and modified the
   file for better autoconf 2.5x compliance

That's fine, as long as it's uncoupled from other changes.
Specifically, it should be possible to test all Autoconf-related
changes.

 * Added profiling support to the configure script

I'm not sure what you mean here.  Why does configure need to be aware
of profilers?

 * Re-named the realclean target to maintainer-clean in the
   Makefiles for better integration with po/Makefile.in.in and
   conformance to the de-facto standards

That should be fine.

 * Modified the invocation of config.status in the targets in the
   Dependencies for maintenance section of Makefile.in, according
   to the new syntax introduced by autoconf 2.5x

I haven't studied the new Autoconf in detail, but I trust that you
know what you're doing here.

 util/Makefile.in: added rmold.pl target, just like texi2pod.pl
 in doc/Makefile.in

 src/wget.h: added better handling of HAVE_ALLOCA_H and changed
 USE_NLS to ENABLE_NLS

Sounds fine.  BTW what do you mean by better handling of
HAVE_ALLOCA_H?  Do you actually know that Wget's code was broken on
some platforms, or are you just replacing old Autoconf boilerplate
code with new one?

Thanks for the work you've put in.

Re: autoconf 2.5 patch for wget

2003-09-10 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

   * how do you generate/maintain gettext-related files (e.g. the files in
 the po directory

 The `.po' files are from the translation project.  POTFILES.IN is
 generated by hand when a new `.c' file is added.

 ok, but what about Makefile.in.in and wget.pot?

AFAIR wget.pot is generated by Makefile.  (It should probably not be
in CVS, though.)  Makefile.in.in is not generated, it was originally
adapted from the original Makefile.in.in from the gettext
distribution.  It has served well for years in the current form.

   * how do you generate/maintain libtool-related files
 (e.g. ltmain.sh)

 When a new libtool release comes out, ltmain.sh is replaced with
 the new one and aclocal.m4 is updated with the latest libtool.m4.
 config.sub and config.guess are updated as needed.

 do you mean that you simply copy these files manually from other
 packages?

Yes.  I don't do that very often.

 how do you update aclocal.m4?

Wget's aclocal.m4 only contains Wget-specific stuff so it doesn't need
special updating.  The single exception is, of course, the
`libtool.m4' part which needs to be updated along with ltmain.sh, but
that is also rare.  I really think aclocal.m4 should simply be
INCLUDEing libtool.m4, but I wasn't sure how to do that, so I left it
at that.  (Note that I wasn't the one who introduced libtool to Wget,
so it wasn't up to me originally.)

 please, notice that i am __NOT__ criticizing this.

Don't worry, I'm not reading malice in your questions.  All your
questions are in fact quite valid and responding to them serves to
remind myself of why I made the choices I did.

 I don't use Automake.  mkinstalldirs and install-sh are standard
 Autoconf stuff

 true.


 that probably hasn't changed for years.

 i am not so sure about this.

If they've changed and if updating them won't break anything, feel
free to update them.  (In a separate patch if possible.:-)).

  * Updated aclocal.m4, config.guess, config.sub

 Note that Wget doesn't use a pre-generated (or auto-generated)
 aclocal.m4.  Updating config.guess and config.sub is, of course, fine.

 how do you maintain aclocal.m4, then? by hand? this seems a bit too
 manual for me :-)

I believe Wget's aclocal.m4 is quite different from the ones in
Automake-influenced software.  I could be wrong, though.  Please take
another look at it, and please do ignore the libtool stuff which
should really be handled with an include.

 and, more important, with this approach your package may keep using
 broken/obsoleted autoconf macros without your knowledge.

I'm not so sure about that.  The way I see it, Wget's configure.in and
aclocal.m4 use documented Autoconf macros.  Unless Autoconf changes
incompatibly (which they shouldn't do without changing the major
version), they should keep working.

  * Added IPv6 stack detection to the configuration process

 Please be careful: Wget doesn't need the kind of stack detection that
 I've seen in many programs patched to support IPv6.

 i am afraid you're wrong here. usagi or kame stack detection is
 necessary to link the binary to libinet6 (if present). this lets
 wget use a version of getaddrinfo which is RFC3493-compliant and
 supports the AI_ALL, AI_ADDRCONFIG (which is __VERY__ important) and
 AI_V4MAPPED flags.  the implementation of getaddrinfo shipped with
 glibc is not RFC3493-compliant.

Shouldn't we simply check for libinet6 in the usual fashion?

Furthermore, I don't think that Wget uses any of those flags.  Why are
should an application that doesn't use them care?  Note that I ask
this not to annoy you but to learn; you obviously know much more about
IPv6 than I do.

I have to go now; I'll answer the rest of your message separately.
Thanks for your patience and for the detailed reply.

Re: autoconf 2.5 patch for wget

2003-09-10 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

 AFAIR wget.pot is generated by Makefile.  (It should probably not be
 in CVS, though.)  Makefile.in.in is not generated, it was originally
 adapted from the original Makefile.in.in from the gettext
 distribution.  It has served well for years in the current form.

 ok. i'll see if the new Makefile.in.in which ships with the latest
 gettext is worth an upgrade.

Note that Wget's Makefile.in.in is likely quite different than the
canonical version because of the lack of libintl bundling.  That's as
it should be.

  how do you update aclocal.m4?

 Wget's aclocal.m4 only contains Wget-specific stuff so it doesn't need
 special updating.  The single exception is, of course, the
 `libtool.m4' part which needs to be updated along with ltmain.sh, but
 that is also rare.  I really think aclocal.m4 should simply be
 INCLUDEing libtool.m4, but I wasn't sure how to do that, so I left it
 at that.  (Note that I wasn't the one who introduced libtool to Wget,
 so it wasn't up to me originally.)

 ok, so you simply take libtool.m4 or maybe only a part of it, and add all
 wget-specific macros to it.

Or the other way around: leave Wget-specific macros and replace
libtool.m4 contents.  aclocal.m4 has this part:

# We embed libtool.m4 from libtool distribution.

# -- embedded libtool.m4 begins here --

[ ... contents of libtool.m4 follows ... ]

# -- embedded libtool.m4 ends here --

When you need to update libtool.m4, you do the obvious -- replace the
old contents of libtool.m4 with the new contents.

As I said, it would be even better if it said something like
AC_INCLUDE([libtool.m4]) (or whatever the correct syntax is), so you
can simply drop in the new libtool.m4 without the need for editing.

 Shouldn't we simply check for libinet6 in the usual fashion?

 this could be another solution. but i think it would be much better
 to do it only for kame and usagi stack.

Hmm.  Checking for stacks by names is not the Autoconf way.  Isn't
it better to test for needed features?  Daniel's test was written in
that spirit.

 Furthermore, I don't think that Wget uses any of those flags.  Why are
 should an application that doesn't use them care?  Note that I ask
 this not to annoy you but to learn; you obviously know much more about
 IPv6 than I do.

 well, it is very important using AI_ADDRCONFIG with getaddrinfo. in this
 way you get resolution of  records only if you have ipv6 working on
 your host (and, less important, resolution of A records only if you
 have ipv4 working on your host). dns resolution in a mixed ipv4 and ipv6
 environment is a nightmare and AI_ADDRCONFIG can save you a lot of
 headaches.

Very interesting.  So what you're saying is that programs that simply
follow the getaddrinfo man page (including IPv6-enabled Wget in
Debian) don't work in mixed environments?  That's really strange.

Re: using host-cache configurable via command line

2003-09-10 Thread Hrvoje Niksic

Patrick Cernko [EMAIL PROTECTED] writes:

 I discovered a small problem with the increasing number of servers with
 canching IPs but constant name (provided by Nameservers like
 dyndns.org). If the download with wget is interrupted by a IP change
 (e.g. a dialup host whose provider killed the connection), wget retries
 the download using the previously cached IP. This will fail as the host
 (specified by its dyndns-Hostname) is no longer reachable via this old
 IP. Instead he is reachable over a new IP (assigned by its provider).
 But it is still reachable via its hostname as the host updated the DNS
 entry with its new IP.
 
 So I patched wget to tell it, not to use the cached IPs from earlier but
 instead do a new host lookup like for the first time connecting the
 host.

Patrick, thanks for the patch and the explanation.  A similar change,
probably with invocation `--dns-cache=off', is scheduled to appear in
the next release.

Your contribution is also important because we've been looking for a
suitable text for the manual that explains why it is sometimes
beneficial to turn off the DNS cache.

Re: autoconf 2.5 patch for wget

2003-09-11 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

 On Wed, 10 Sep 2003, Hrvoje Niksic wrote:

 Mauro Tortonesi [EMAIL PROTECTED] writes:

  Shouldn't we simply check for libinet6 in the usual fashion?
 
  this could be another solution. but i think it would be much better
  to do it only for kame and usagi stack.

 Hmm.  Checking for stacks by names is not the Autoconf way.  Isn't
 it better to test for needed features?  Daniel's test was written in
 that spirit.

 i think kame or usagi stack detection is not so ugly and works
 better than the simple detection of libinet6. in fact, if you don't
 want to perform stack detection, you have to test if libinet6 is
 installed on the host system __and__ if the getaddrinfo function
 contained in libinet6 is better that the one shipped with the
 libc.
 it is a cleaner (and better) approach, but much more
 complicated and error prone than stack detection, IMVHO.

Isn't the second check a matter of running a small test program, as in
the check that Daniel provided (but more sophisticated)?

If we absolutely must detect kame and usagi (whatever those
are:), we'll do so.  But I'd like to be sure that other options
have been researched.

  Furthermore, I don't think that Wget uses any of those flags.  Why are
  should an application that doesn't use them care?  Note that I ask
  this not to annoy you but to learn; you obviously know much more about
  IPv6 than I do.
 
  well, it is very important using AI_ADDRCONFIG with getaddrinfo. in this
  way you get resolution of  records only if you have ipv6 working on
  your host (and, less important, resolution of A records only if you
  have ipv4 working on your host). dns resolution in a mixed ipv4 and ipv6
  environment is a nightmare and AI_ADDRCONFIG can save you a lot of
  headaches.

 Very interesting.  So what you're saying is that programs that simply
 follow the getaddrinfo man page (including IPv6-enabled Wget in
 Debian) don't work in mixed environments?  That's really strange.

 no, i'm not saying that. i'm saying that if you have a program that calls
 getaddrinfo on an ipv6(4)-only host you get also A() records with
 ipv4(6) addresses that you cannot connect to. this may slow down the
 connection process (if the code is well written), or simply break it
 (if the code is badly written) and may also cause other subtle
 problems.

Then it sounds like definitely want to use this flag.  However, you go
on to say:

 by using the AI_ADDRCONFIG flag with getaddrinfo on a ipv6(4)-only
 host you get only (A) records. however, i think that a use ipvX
 only configuration option is a better solution than AI_ADDRCONFIG.

Better solution in the sense that we shouldn't use AI_ADDRCONFIG
after all?  Or that this configuration option should be an alternative
to AI_ADDRCONFIG?

If the latter is the case, should be a use ipvX only runtime option
as well?

Re: autoconf 2.5 patch for wget

2003-09-11 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

Isn't the second check a matter of running a small test program, as in
the check that Daniel provided (but more sophisticated)?

sure. but what was the problem with stack detection? it's simply a couple
of AC_EGREP_CPP macros after all...

The problem I have with IN6_GUESS_STACK is that it seems to rely on
product information, in this case the known stack names. And those
things change. So when usagi gets renamed to yojimbo or when we
port Wget to a new IPv6-aware architecture, or when a new IPv6
implementation gets added to an existing architecture, we need to
update our Autoconf macros. Updating the macros sucks, not only
because M4 blows chunks, but also because it means that older source
releases of Wget will no longer work.

One of the design goals of Autoconf was to avoid the fallacy of older
tools that had complex product databases that had to be maintained by
hand. Instead, most Autoconf tests try to check for features. The
exception are cases when such checks are not possible or feasible.
This might or might not be the case here. So if it really takes too
long or it's just too hard to write a check, then we'll use a version
of IN6_GUESS_STACK.

i could start from:

http://cvs.deepspace6.net/view/nc6/config/in6_guess_stack.m4?rev=HEADcontent-type=text/vnd.viewcvs-markup

and made it much simpler (15-30 lines). what is your opinion about
it?

Simplifying that code, *and* adding a fallback that handles unknown
stacks in a reasonable fashion (for example by assuming minimal
functionality or strict standard compliance) sounds fine to me. I'd
still prefer a purely feature based check, but again, if you tell me
it's hard or impossible to write one, I'll believe you.

If the latter is the case, should be a use ipvX only runtime
option as well?

i think that -4 and -6 command line options for wget are a MUST. the
first would make wget use ipv4 only, while the second would make
wget use ipv6 only. believe me, there are plenty of cases in which
you want to use such options.

I agree that those options are useful. And since Wget doesn't
currently use numeric-only options, those are available.

Re: bug in wget - wget break on time msec=0

2003-09-13 Thread Hrvoje Niksic

Boehn, Gunnar von [EMAIL PROTECTED] writes:

 I think I found a bug in wget.

You did.  But I believe your subject line is slightly incorrect.  Wget
handles 0 length time intervals (see the assert message), but what it
doesn't handle are negative amounts.  And indeed:

 gettimeofday({1063461157, 858103}, NULL) = 0
 gettimeofday({1063461157, 858783}, NULL) = 0
 gettimeofday({1063461157, 880833}, NULL) = 0
 gettimeofday({1063461157, 874729}, NULL) = 0

As you can see, the last gettimeofday returned time *preceding* the
one before it.  Your ntp daemon must have chosen that precise moment
to set back the system clock by ~6 milliseconds, to which Wget reacted
badly.

Even so, Wget shouldn't crash.  The correct fix is to disallow the
timer code from ever returning decreasing or negative time intervals.
Please let me know if this patch fixes the problem:


2003-09-14  Hrvoje Niksic  [EMAIL PROTECTED]

* utils.c (wtimer_sys_set): Extracted the code that sets the
current time here.
(wtimer_reset): Call it.
(wtimer_sys_diff): Extracted the code that calculates the
difference between two system times here.
(wtimer_elapsed): Call it.
(wtimer_elapsed): Don't return a value smaller than the previous
one, which could previously happen when system time is set back.
Instead, reset start time to current time and note the elapsed
offset for future calculations.  The returned times are now
guaranteed to be monotonically nondecreasing.

Index: src/utils.c
===
RCS file: /pack/anoncvs/wget/src/utils.c,v
retrieving revision 1.51
diff -u -r1.51 utils.c
--- src/utils.c 2002/05/18 02:16:25 1.51
+++ src/utils.c 2003/09/13 23:09:13
@@ -1532,19 +1532,30 @@
 # endif
 #endif /* not WINDOWS */
 
-struct wget_timer {
 #ifdef TIMER_GETTIMEOFDAY
-  long secs;
-  long usecs;
+typedef struct timeval wget_sys_time;
 #endif
 
 #ifdef TIMER_TIME
-  time_t secs;
+typedef time_t wget_sys_time;
 #endif
 
 #ifdef TIMER_WINDOWS
-  ULARGE_INTEGER wintime;
+typedef ULARGE_INTEGER wget_sys_time;
 #endif
+
+struct wget_timer {
+  /* The starting point in time which, subtracted from the current
+ time, yields elapsed time. */
+  wget_sys_time start;
+
+  /* The most recent elapsed time, calculated by wtimer_elapsed().
+ Measured in milliseconds.  */
+  long elapsed_last;
+
+  /* Approximately, the time elapsed between the true start of the
+ measurement and the time represented by START.  */
+  long elapsed_pre_start;
 };
 
 /* Allocate a timer.  It is not legal to do anything with a freshly
@@ -1577,22 +1588,17 @@
   xfree (wt);
 }
 
-/* Reset timer WT.  This establishes the starting point from which
-   wtimer_elapsed() will return the number of elapsed
-   milliseconds.  It is allowed to reset a previously used timer.  */
+/* Store system time to WST.  */
 
-void
-wtimer_reset (struct wget_timer *wt)
+static void
+wtimer_sys_set (wget_sys_time *wst)
 {
 #ifdef TIMER_GETTIMEOFDAY
-  struct timeval t;
-  gettimeofday (t, NULL);
-  wt-secs  = t.tv_sec;
-  wt-usecs = t.tv_usec;
+  gettimeofday (wst, NULL);
 #endif
 
 #ifdef TIMER_TIME
-  wt-secs = time (NULL);
+  time (wst);
 #endif
 
 #ifdef TIMER_WINDOWS
@@ -1600,39 +1606,76 @@
   SYSTEMTIME st;
   GetSystemTime (st);
   SystemTimeToFileTime (st, ft);
-  wt-wintime.HighPart = ft.dwHighDateTime;
-  wt-wintime.LowPart  = ft.dwLowDateTime;
+  wst-HighPart = ft.dwHighDateTime;
+  wst-LowPart  = ft.dwLowDateTime;
 #endif
 }
 
-/* Return the number of milliseconds elapsed since the timer was last
-   reset.  It is allowed to call this function more than once to get
-   increasingly higher elapsed values.  */
+/* Reset timer WT.  This establishes the starting point from which
+   wtimer_elapsed() will return the number of elapsed
+   milliseconds.  It is allowed to reset a previously used timer.  */
 
-long
-wtimer_elapsed (struct wget_timer *wt)
+void
+wtimer_reset (struct wget_timer *wt)
 {
+  /* Set the start time to the current time. */
+  wtimer_sys_set (wt-start);
+  wt-elapsed_last = 0;
+  wt-elapsed_pre_start = 0;
+}
+
+static long
+wtimer_sys_diff (wget_sys_time *wst1, wget_sys_time *wst2)
+{
 #ifdef TIMER_GETTIMEOFDAY
-  struct timeval t;
-  gettimeofday (t, NULL);
-  return (t.tv_sec - wt-secs) * 1000 + (t.tv_usec - wt-usecs) / 1000;
+  return ((wst1-tv_sec - wst2-tv_sec) * 1000
+ + (wst1-tv_usec - wst2-tv_usec) / 1000);
 #endif
 
 #ifdef TIMER_TIME
-  time_t now = time (NULL);
-  return 1000 * (now - wt-secs);
+  return 1000 * (*wst1 - *wst2);
 #endif
 
 #ifdef WINDOWS
-  FILETIME ft;
-  SYSTEMTIME st;
-  ULARGE_INTEGER uli;
-  GetSystemTime (st);
-  SystemTimeToFileTime (st, ft);
-  uli.HighPart = ft.dwHighDateTime;
-  uli.LowPart = ft.dwLowDateTime;
-  return (long)((uli.QuadPart - wt-wintime.QuadPart) / 1);
+  return (long)(wst1-QuadPart - wst2-QuadPart) / 1;
 #endif
+}
+
+/* Return the number of milliseconds

More flexible URL file name generation

2003-09-14 Thread Hrvoje Niksic

This patch makes URL file name generation a bit more flexible and,
hopefully, better for the end-user.  It does two things:

* Decouples file name quoting from URL quoting.  The conflation of the
  two has been an endless source of annoyance for users.  For example,
  space *has* to be quoted in URLs, but you don't really want to quote
  it in file names.

* Gives the user more control over the quoting mechanism.  There are
  now several quoting levels:

--restrict-file-names=none  - no restriction, only quote / and \0

--restrict-file-names=unix  - quote the above, plus chars in the
  0-31 and in the 128-159 range, which
  are not printable in the shell.

--restrict-file-names=windows - quote the above, plus chars
disallowed on Windows: \, |, , ,
?, :, *, and .

  The default windows under Windows and Cygwin and unix elsewhere.

This patch should supersede the various patches that have been
floating around that fix the problem in a limited fashion.  Please
test this patch and let me know if it works for you, and if something
else is needed.


2003-09-14  Hrvoje Niksic  [EMAIL PROTECTED]

* url.c (append_uri_pathel): Use opt.restrict_file_names when
calling file_unsafe_char.

* init.c: New command restrict_file_names.

* main.c (main): New option --restrict-file-names[=windows,unix].

* url.c (url_file_name): Renamed from url_filename.
(url_file_name): Add directory and hostdir prefix here, not in
mkstruct.
(append_dir_structure): New function, does part of the work that
used to be in mkstruct.  Iterates over path elements in u-path,
calling append_uri_pathel on each one to append it to the file
name.
(append_uri_pathel): URL-unescape a path element and reencode it
with a different set of rules, more appropriate for handling of
files.
(file_unsafe_char): New function, uses a lookup table to decide
whether a character should be escaped for use in file name.
(append_string): New utility function.
(append_char): Ditto.
(file_unsafe_char): New argument restrict_for_windows, decide
whether Windows file names should be escaped in run-time.

* connect.c: Include stdlib.h to get prototype for abort().

Index: NEWS
===
RCS file: /pack/anoncvs/wget/NEWS,v
retrieving revision 1.38
diff -u -r1.38 NEWS
--- NEWS2003/09/10 20:21:13 1.38
+++ NEWS2003/09/14 21:45:48
@@ -7,8 +7,6 @@
 
 * Changes in Wget 1.9.
 
-** The build process now requires Autoconf 2.5x.
-
 ** It is now possible to specify that POST method be used for HTTP
 requests.  For example, `wget --post-data=id=foodata=bar URL' will
 send a POST request with the specified contents.
@@ -32,6 +30,15 @@
 
 ** The new option `--dns-cache=off' may be used to prevent Wget from
 caching DNS lookups.
+
+** The build process now requires Autoconf 2.5x.
+
+** Wget no longer quotes characters in local file names that would be
+considered unsafe as part of URL.  Quoting can still occur for
+control characters or for '/', but no longer for frequent characters
+such as space.  You can use the new option --restrict-file-names to
+enforce even stricter rules, which is useful when downloading to
+Windows partitions.
 
 * Wget 1.8.1 is a bugfix release with no user-visible changes.
 
Index: doc/wget.texi
===
RCS file: /pack/anoncvs/wget/doc/wget.texi,v
retrieving revision 1.68
diff -u -r1.68 wget.texi
--- doc/wget.texi   2003/09/10 19:41:50 1.68
+++ doc/wget.texi   2003/09/14 21:46:10
@@ -800,6 +800,39 @@
 
 If you don't understand the above description, you probably won't need
 this option.
+
[EMAIL PROTECTED] file names, restrict
[EMAIL PROTECTED] Windows file names
[EMAIL PROTECTED] --restrict-file-names=none|unix|windows
+Restrict characters that may occur in local file names created by Wget
+from remote URLs.  Characters that are considered @dfn{unsafe} under a
+set of restrictions are escaped, i.e. replaced with @samp{%XX}, where
[EMAIL PROTECTED] is the hexadecimal code of the character.
+
+The default for this option depends on the operating system: on Unix and
+Unix-like OS'es, it defaults to ``unix''.  Under Windows and Cygwin, it
+defaults to ``windows''.  Changing the default is useful when you are
+using a non-native partition, e.g. when downloading files to a Windows
+partition mounted from Linux, or when using NFS-mounted or SMB-mounted
+Windows drives.
+
+When set to ``none'', the only characters that are quoted are those that
+are impossible to get into a file name---the NUL character and @samp{/}.
+The control characters, newline, etc. are all placed into file names.
+
+When set to ``unix

Re: wget proxy support

2003-09-14 Thread Hrvoje Niksic

Nicolas, thanks for the patch; I'm about to apply it to Wget CVS.

Re: upcoming new wget version

2003-09-15 Thread Hrvoje Niksic

Hrvoje Niksic [EMAIL PROTECTED] writes:

 Jochen Roderburg [EMAIL PROTECTED] writes:

 Question: Is the often discussed *feature* in version 1.8.x meanwhile
 repaired, that special characters in local filenames are
 url-encoded?

 Hmm, that was another thing scheduled to be fixed for 1.9.

I believe that the feature has now been fixed.  Please try the
latest CVS and let me know what you think.

BTW URL-escaping special chars in file names is not specific to 1.8.x.
All Wget versions until 1.9 have suffered to some extent from file
quoting being coupled with URL quoting.  It became worse in 1.8.x
because it implemented stricter [and more correct] URL escaping rules
-- which happened to be even less appropriate for file names.

Re: small doc update patch

2003-09-15 Thread Hrvoje Niksic

Noèl Köthe [EMAIL PROTECTED] writes:

 Am Mi, 2003-09-10 um 22.21 schrieb Hrvoje Niksic:

  Just a small patch for the documentation:
 
  --- wget-1.8.2.orig/doc/wget.texi
  +++ wget-1.8.2/doc/wget.texi
  @@ -507,7 +507,7 @@
   @item -t @var{number}
   @itemx [EMAIL PROTECTED]
   Set number of retries to @var{number}.  Specify 0 or @samp{inf} for
  -infinite retrying.
  +infinite retrying. Default (no command-line switch) is not to
  retry.
 
 Huh?  The default is to retry 20 times.  Isn't it?  :-)

 Hmm, then i got it wrong:

 $ LC_ALL=C wget -t 0 http://localhost/asdf
 --00:22:01--  http://localhost/asdf
= `asdf'
 Resolving localhost... done.
 Connecting to localhost[127.0.0.1]:80... failed: Connection refused.

It doesn't for fatal errors, such as connection refused, and --tries
doesn't change that.

The flag --retry-connrefused, new in CVS, tells Wget to treat
connection refused as a non-fatal error.

Re: windows compile error

2003-09-16 Thread Hrvoje Niksic

Herold Heiko [EMAIL PROTECTED] writes:

 Just a quick note, the current cvs code on windows during compile (with
 VC++6) stops with

 cl /I. /DWINDOWS /D_CONSOLE /DHAVE_CONFIG_H /DSYSTEM_WGETRC=\wgetrc\
 /DHAVE_SSL /nologo /MT /W0 /O2 /c utils.c
 utils.c
 utils.c(1651) : error C2520: conversion from unsigned __int64 to double not
 implemented, use signed __int64

 The culprit seems to be (in wtimer_sys_diff)

 #ifdef WINDOWS
   return (double)(wst1-QuadPart - wst2-QuadPart) / 1;
 #endif

Does this patch help?

2003-09-16  Hrvoje Niksic  [EMAIL PROTECTED]
 
* utils.c (wtimer_sys_diff): Convert the time difference to signed
__int64, then to double.  This works around MS VC++ 6 which can't
convert unsigned __int64 to double directly.

Index: src/utils.c
===
RCS file: /pack/anoncvs/wget/src/utils.c,v
retrieving revision 1.54
diff -u -r1.54 utils.c
--- src/utils.c 2003/09/15 21:14:15 1.54
+++ src/utils.c 2003/09/16 21:01:02
@@ -1648,7 +1648,10 @@
 #endif
 
 #ifdef WINDOWS
-  return (double)(wst1-QuadPart - wst2-QuadPart) / 1;
+  /* VC++ 6 doesn't support direct cast of uint64 to double.  To work
+ around this, we subtract, then convert to signed, then finally to
+ double.  */
+  return (double)(signed __int64)(wst1-QuadPart - wst2-QuadPart) / 1;
 #endif
 }

Re: bug in wget 1.8.1/1.8.2

2003-09-16 Thread Hrvoje Niksic

Dieter Drossmann [EMAIL PROTECTED] writes:

 I use a extra file with a long list of http entries. I included this
 file with the -i option.  After 154 downloads I got an error
 message: Segmentation fault.

 With wget 1.7.1 everything works well.

 Is there a new limit of lines?

No, there's no built-in line limit, what you're seeing is a bug.

I cannot see anything wrong inspecting the code, so you'll have to
help by providing a gdb backtrace.  You can get it by doing this:

* Compile Wget with `-g' by running `make CFLAGS=-g' in its source
  directory (after configure, of course.)

* Go to the src/ directory and run that version of Wget the same way
  you normally run it, e.g. ./wget -i FILE.

* When Wget crashes, run `gdb wget core', type `bt' and mail us the
  resulting stack trace.

Thanks for the report.

Re: Incomplete man page on wget

2003-09-16 Thread Hrvoje Niksic

Mitra [EMAIL PROTECTED] writes:

 Hi,

 Thanks for the response.

   I've never used Info before, except for documentation of emacs and
   very few things are documented there. I suggest it should be
   presumed that people will look at man wget or wget --help and
   make sure the documentation is either the same, or that there is a
   level of indirection to info wget

You are right.  The current man page does not seem to mention that it
is only an excerpt from the entire documentation, and that is a bug.

As for Info, note that Wget is a GNU program, and Info is the
preferred documentation format of the GNU project.

Re: windows compile error

2003-09-17 Thread Hrvoje Niksic

Herold Heiko [EMAIL PROTECTED] writes:

 Does compile now, but I managed to produce an application error during a
 test run on a https site.

 I produced a debug build with /DDEBUG /Zi /Od /Fd /FR and produced the
 wget.bsc by running bscmake on all the sbr files, but I didn't yet
 understand how to use that one in VC++ in order to get a meaningfull stack
 trace and so on.
 The only thing I got for now is :SSLEAY32! 0023ca38() as the breaking
 point.

It sounds like an https thing.

Is the error repeatable?  If so, can you repeat it an earlier CVS
snapshot?

Re: Small change to print SSL version

2003-09-17 Thread Hrvoje Niksic

Christopher G. Lewis [EMAIL PROTECTED] writes:

 Here's a small change to print out the OpenSSL version with the -V 
 --help parameters.
[...]

I think that GNU Wget something should always stand for Wget's
version, regardless of the libraries it has been compiled with.  But
if you want to see the version of libraries, why not make it clearer,
e.g.:

GNU Wget x.x.x (compiled with OpenSSL x.x.x)

BTW can't you find out OpenSSL version by using `ldd'?

Re: Handling of Content-Length 0

2003-09-17 Thread Hrvoje Niksic

Stefan Eissing [EMAIL PROTECTED] writes:

 Of course this is only noticable with HTTP/1.1 server which leave
 the connection open and do not apply transfer-encding: chunked for
 empty response bodies.

They may not apply chunked transfer because Wget doesn't know how to
handle it.  And leaving the connections open is also Wget's bug
because it explicitly requests it.

 I imagine this should be quite easy to fix...

Yes.  Patch following RSN.

Re: small doc update patch

2003-09-17 Thread Hrvoje Niksic

Noèl Köthe [EMAIL PROTECTED] writes:

 -infinite retrying.
 +infinite retrying. Default (no command-line switch) is to retry
 +20 times but fatal errors like connection refused or not found
 +(404) are not being retried.

Thanks.  I've now committed this:

Index: doc/wget.texi
===
RCS file: /pack/anoncvs/wget/doc/wget.texi,v
retrieving revision 1.75
retrieving revision 1.77
diff -u -r1.75 -r1.77
--- doc/wget.texi   2003/09/17 01:32:02 1.75
+++ doc/wget.texi   2003/09/17 21:00:03 1.77
@@ -512,7 +512,9 @@
 @item -t @var{number}
 @itemx [EMAIL PROTECTED]
 Set number of retries to @var{number}.  Specify 0 or @samp{inf} for
-infinite retrying.
+infinite retrying.  The default is to retry 20 times, with the exception
+of fatal errors like ``connection refused'' or ``not found'' (404),
+which are not retried.
 
 @item -O @var{file}
 @itemx [EMAIL PROTECTED]

Re: windows compile error

2003-09-17 Thread Hrvoje Niksic

Herold Heiko [EMAIL PROTECTED] writes:

 Repeatable, and it seems to appear with this:

 2003-09-15  Hrvoje Niksic  [EMAIL PROTECTED]

   * retr.c (get_contents): Reduce the buffer size to the amount of
   data that may pass through for one second.  This prevents long
   sleeps when limiting bandwidth.

   * connect.c (connect_to_one): Reduce the socket's RCVBUF when
   bandwidth limitation to small values is requested.

 Previous checkout (checkout -D 23:30 15 sep 2003) wget works fine.
 I also found a public site which seems to expose the problem (at least from
 my machine):
 wget -dv https://www.shavlik.com/pHome.aspx
 dies after
 DEBUG output created by Wget 1.9-beta on Windows.
[...]

Herold, I'm currently having problems obtaining a working SSL build,
so I'll need your help with this.

Notice that the above change in fact consists of two changes: one to
`retr.c', and the other to `connect.c'.  Please try to figure out
which one is responsible for the crash.  Then we'll have a better idea
of what to look for.

Re: Handling of Content-Length 0

2003-09-17 Thread Hrvoje Niksic

Stefan Eissing [EMAIL PROTECTED] writes:

 Please excuse if this bug has already been reported:

 In wget 1.8.1 (OS X) and 1.8.2 (cygwin) the handling of resources
 with content-length 0 is wrong. wget tries to read the empty content
 and hangs until the socket read timeout fires. (I set the timeout to
 different values and it exactly matches the termination of the GET).

 Of course this is only noticable with HTTP/1.1 server which leave the
 connection open and do not apply transfer-encding: chunked for empty
 response bodies.

I've now examined the source code, and I believe Wget handles this
case correctly: if keep-alive is in use, it reads only as much data as
specified by Content-Length, attempting no read if content-length is
0.

The one case I can see might go wrong is that a server leaves a
connection hanging without having told Wget it was about to do so.
Then Wget will, being a HTTP/1.0 client, try to read all data from the
socket regardless of Content-Length.

Do you have a URL which you use to repeat this?

Re: minor problem with @xref in documentation

2003-09-18 Thread Hrvoje Niksic

Noèl Köthe [EMAIL PROTECTED] writes:

 at the end of the description of the option --http-passwd=password:

 For more information about security issues with Wget,

 The sentence is incomplete.

 wget.texi shows:

 For more information about security issues with Wget, @xref{Security
 Considerations}.

 The info page has a correct link.

 Any idea how to fix this for the manpage?

Maybe we should hack texi2pod to change @xref{.*} to see the node \1
of the Info documentation?

Re: windows compile error

2003-09-18 Thread Hrvoje Niksic

Herold Heiko [EMAIL PROTECTED] writes:

 Found it.
 Using the 23:00 connect.c and the 23:59 retr.c does produce the bug.
 Using the 23:59 connect.c and the 23:00 retr.c works ok.
 This means the problem must be in retr.c .

OK, that narrows it down.  Two further questions:

1) If you comment out lines 180 and 181 of retr.c, does the problem go
   away?

1a) How about if you replace line 181 with `dlbufsize = sizeof(dlbuf)'?

2) Do you even specify --limit-rate?  If so, to what size?

Re: windows compile error

2003-09-18 Thread Hrvoje Niksic

I've noticed the mistake as soon as I compiled with SSL (and saw the
warnings):

2003-09-18  Hrvoje Niksic  [EMAIL PROTECTED]

* retr.c (get_contents): Pass the correct argument to ssl_iread.

Index: src/retr.c
===
RCS file: /pack/anoncvs/wget/src/retr.c,v
retrieving revision 1.57
diff -u -r1.57 retr.c
--- src/retr.c  2003/09/15 21:48:43 1.57
+++ src/retr.c  2003/09/18 11:41:56
@@ -191,7 +191,7 @@
? MIN (expected - *len, dlbufsize) : dlbufsize);
 #ifdef HAVE_SSL
   if (rbuf-ssl!=NULL)
-   res = ssl_iread (rbuf-ssl, dlbufsize, amount_to_read);
+   res = ssl_iread (rbuf-ssl, dlbuf, amount_to_read);
   else
 #endif /* HAVE_SSL */
res = iread (fd, dlbuf, amount_to_read);

Re: protocols directories ?

2003-09-18 Thread Hrvoje Niksic

Herold Heiko [EMAIL PROTECTED] writes:

 Solution 1: have a switch like --use-protocol-dir = [no|most|all]

 no would be the current state:
 ./www.some.site/index.html
 ./www.some.site/index.html
 ./www.some.site/index.html

 all would be: always add a directory level for the protocol:
 ./http/www.some.site/index.html
 ./https/www.some.site/index.html
 ./ftp/www.some.site/index.html

That sounds like a good suggestion, except, I'd personally go for a
simple yes/no.  People who don't need it will never use it, and people
who do need it won't mind the all semantics (I think).

Plus, *plug*, in the new code, it's dead easy to add.  For example,
in url_file_name, url.c:1691, you could write:

if (opt.add_protocol_dir)
  append_string (scheme_name (u-scheme), fnres);

Implementation of scheme_name is left as an excercise to the reader.
:-)

Re: non-recursion

2003-09-18 Thread Hrvoje Niksic

Ilya N. Golubev [EMAIL PROTECTED] writes:

 Duplicating my [EMAIL PROTECTED] sent on Wed, 10 Sep 2003
 19:48:56 +0400 since mailer reports that [EMAIL PROTECTED] does not
 work.

 wget -mLd http://www.hro.org/docs/rlex/uk/index.htm

 does not follow `A HREF=uk1.htm#1' links contained in the
 resource.

That's because Wget thinks those links are part of a huge comment that
spans the better part of the document.  Unlike most browsers, Wget
implements a (too) strict comment parsing, which breaks pages that use
non-SGML-compliant comments.

As http://www.htmlhelp.com/reference/wilbur/misc/comment.html explains:

[...] There is also the problem with the -- sequence. Some
people have a habit of using things like !-- as
separators in their source. Unfortunately, in most cases, the
number of - characters is not a multiple of four. This means
that a browser who tries to get it right will actually get it
wrong here and actually hide the rest of the document.

Currently the only workaround is to alter the source, e.g. by
modifying advance_declaration() in html-parse.c.  A future version of
Wget will probably parse comments in a non-compliant fashion, by
considering everything between !-- and -- to be a comment, which is
what most other browsers have been doing since the beginnings of the
web.

Re: wget renaming URL/file downloaded, how to???

2003-09-18 Thread Hrvoje Niksic

Lucuk, Pete [EMAIL PROTECTED] writes:

 as we can see above, wget has raznoe.shtml.html as the main file,
 this is *not* what I want, I *always* want the main file to be name
 index.html.

Wget doesn't really have the concept of a main file.  As a
workaround, you could simply `ln -s raznoe.shtml.html index.html', and
index.html would point to your main file and be available on the web.

Re: non-recursion

2003-09-19 Thread Hrvoje Niksic

Doug Kaufman [EMAIL PROTECTED] writes:

 On Thu, 18 Sep 2003, Hrvoje Niksic wrote:

 modifying advance_declaration() in html-parse.c.  A future version of
 Wget will probably parse comments in a non-compliant fashion, by
 considering everything between !-- and -- to be a comment, which is
 what most other browsers have been doing since the beginnings of the
 web.

 The lynx browser is configurable as to how it parses comments.

So is Wget, as of last night.  The default is minimal (non-compliant)
comment parsing, and that can be changed with `--strict-comments'.

 It can change on the fly from minimal comments to historical
 comments to valid comments. Which browsers act in non-compliant
 fashion all the time?

Those that display http://www.hro.org/docs/rlex/uk/index.htm (unless
I'm mistaken), and that would mean pretty much all of them.  Of
course, that page is but one example out of many.

Some browsers have more complex heuristics for comment parsing, but
adding that to Wget would probably be overdoing it.

Re: Read error (Success) in headers using wget and ssl

2003-09-19 Thread Hrvoje Niksic

Dimitri Ars [EMAIL PROTECTED] writes:

 I'm having trouble connecting with wget to a site using SSL:
[...]

I can repeat this, but currently I don't understand enough about SSL
to fix it.  Christian, could you please help?

 wget https://145.222.135.165/index.htm
 --13:46:36--  https://145.222.135.165/index.htm
= `index.htm'
 Connecting to 145.222.135.165:443... connected.
 HTTP request sent, awaiting response...
 Read error (Success) in headers.
 Retrying.

 --13:46:37--  https://145.222.135.165/index.htm
   (try: 2) = `index.htm'
 Connecting to 145.222.135.165:443... connected.
 HTTP request sent, awaiting response...
 Read error (Success) in headers.
 Retrying.
 ---

 Expected:
 Unable to establish SSL connection.
 because it's using client certificates, but when using the client
 certificate the same error occurs, so this doesn't seem a
 clientcertificate problem, thought it might be that wget is having trouble
 checking that it does need a client certificate ?!

 Ofcourse using IE as a browser (and the client certificate), no problem...

 Any idea how to fix this ?
 I used wget 1.8.2 and a nightly cvs of 20030909, same problem
 (Please reply directly too as I'm not on the list)

 Best regards,

 Dimitri

Re: Any comments on my feature requests ?

2003-09-21 Thread Hrvoje Niksic

Sorry about the lack of response.  Your feature requests are quite
reasonable, but I have no idea of the timeframe when I'll work on them
(they're not a priority for me).  Perhaps someone else is interested
in helping implement them.

The things I planned to tackle for a post-1.9 release are compression
support and proper password manager.

BTW, have you tried `--http-user' and `--http-passwd'?  They're
supposed to do pretty much what you describe.

Re: Any comments on my feature requests ?

2003-09-21 Thread Hrvoje Niksic

Mark Veltzer [EMAIL PROTECTED] writes:

 On Monday 22 September 2003 00:20, you wrote:
 Sorry about the lack of response.  Your feature requests are quite
 reasonable, but I have no idea of the timeframe when I'll work on
 them (they're not a priority for me).  Perhaps someone else is
 interested in helping implement them.

 The things I planned to tackle for a post-1.9 release are
 compression support and proper password manager.

 BTW, have you tried `--http-user' and `--http-passwd'?  They're
 supposed to do pretty much what you describe.

 That's weied. I tried --http-user and --http-passwd and all is
 working well.  According to the documentation the following are
 equivalent:

   wget -r --http-user=foo --http-passwd=bar http://my.org

 and

   wget -r http://foo:[EMAIL PROTECTED]

 But they are not. Version 1 works while version 2 doesnt ?!?

Does the manual really say that they are equivalent?

When you specify `--http-user' and `--http-passwd', they are used for
*all* the downloads.  When you specify the username and password in a
URL, they are used for that URL and not others.  That can be
considered a bug, but that's how it is.

Re: Any comments on my feature requests ?

2003-09-21 Thread Hrvoje Niksic

Mark Veltzer [EMAIL PROTECTED] writes:

 In addition I would add a flag that makes the URL method work like
 the explicit method and vice versa. This would cover all bases.

The semantics of that flag aren't as obvious as it may seem.  For
example, it's completely legal to do this:

wget -r http://user1:[EMAIL PROTECTED]/foo/ http://user2:[EMAIL PROTECTED]/bar/

Portable representation of large integers

2003-09-22 Thread Hrvoje Niksic

In these enlightened times when 2G+ or large files are no longer
considered large even in the third world, more and more people ask for
the ability to download huge files with Wget.

Wget carefully uses `long' for potentially large values, such as
file sizes and offsets, but that has no effect on the most popular
32-bit architectures, where `long' and `int' are both 32-bit
quantities.  (It does help on 16-bit architectures where `int' is
16-bit, and it helps under 64-bit LP64 environments where int is
32-bit, but `long' and `long long' are 64-bit.)

There have been several attempts to fix this:

* The hack called VERY_LONG_TYPE is used to store values that can be
  reasonably larger than 2G, such as the sum of all downloads.
  However, on machines without `long long', VERY_LONG_TYPE will be
  long.  Since it is not used for anything critical, that's not much
  of a problem (and Wget is careful to detect overflows when adding to
  the sum, so bogus values are not printed.)

* SuSE incorporated patches that change Wget's use of `long' to
  `unsigned long', which upgraded the limit from 2G to 4G.  Aside from
  all the awkwardness that comes from unsigned arithmetic (checking
  for error conditions with x0 doesn't work; you have to use x==-1),
  its effect is limited: if I want to download a 3G file today, I'll
  want to download a 5G file tomorrow.

* In its own patches, Debian introduced the use of large file APIs and
  `long long'.  While that's perfectly fine for Debian, it is not
  portable.  Neither the large file API nor `long long' are
  universally available, and both need thorough configure checking.

I believe that large numbers and large files are orthogonal.  We need
a large numeric type to represent numbers that *could* be large, be it
the sum of downloaded bytes, remote file sizes, or local file sizes or
offsets.  Independently, we need to use large file API where
available, to be able to write and read large files locally.

Of those two issues, choosing and using the numeric type is the hard
one.  Autoconf helps only to an extent -- even if you define your own
`large_number_t' typedef, which is either `long' or `long long', the
question remains how to print that number.  Even worse, some systems
have `long long' (because they use gcc), but don't support it in libc,
so printf can't print it.

One way to solve this is to define macros for printing types.  For
example:

#ifdef HAVE_LONG_LONG
  typedef long long large_number_t;
# define LN_PRINT lld
#else
  typedef double large_number_t;
# define LN_PRINT f
#endif

Then this becomes legal code:

large_number_t num = 0;
printf (The number is: % LN_PRINT !\n, num);

Aside from being butt-ugly, this code has two serious problems.

1. Concatenation of adjacent string literals is an ANSI feature and
   would break pre-ANSI compilers.

2. It breaks gettext.  With translation support, the above code would
   look like this:

 large_number_t num = 0;
 printf (_(The number is: % LN_PRINT !\n), num);

   The message snarfer won't be able to process this because it
   expects a string literal inside _(...).  Even if it were taught
   about string concatenation, it wouldn't know what to replace
   LN_PRINT with, unless it ran the preprocessor.  And if it ran the
   preprocessor, it would get non-portable results (ld or f) which
   cannot be stored to the message catalog.

The bottom line is, I really don't know how to solve this portably.
Does anyone know how widely ported software deals with large files?

Re: Portable representation of large integers

2003-09-22 Thread Hrvoje Niksic

Maciej W. Rozycki [EMAIL PROTECTED] writes:

 On Mon, 22 Sep 2003, Hrvoje Niksic wrote:

  Well, using off_t and AC_SYS_LARGEFILE seems to be the recommended
  practice.
 
 Recommended for POSIX systems, perhaps, but not really portable to
 older machines.  And it doesn't solve the portable printing problem
 either, so in effect it's about as portable as unconditionally using
 `long long', which is mandated by C99.

 I doubt any system that does not support off_t does support LFS.

As I mentioned in the first message, LFS is not the only thing you
need large values for.  Think download quota or the sum of downloaded
bytes.  You should be able to specify `--quota=10G' on systems without
LFS.

As for the hassle, remember that Wget caters to systems with much less
features than LFS on a regular basis.  For example, we suppose
pre-ANSI C compilers, libc's without snprintf, strptime or, for that
matter, basic C89 functions like memcpy or strstr.  So yes, I'd say
pre-LFS systems are worth the hassle.

Perhaps a good compromise would be to use off_t for variables whose
64-bitness doesn't matter without LFS, and a `large_number_t' typedef
that points to either `double' or `long long' for others.  Since the
others are quite rare, printing them won't be a problem in practice,
just like it's not for VERY_LONG_TYPE right now.

 And even if it does, it's probably not worth the hassle.  To handle
 ordinary old systems, you just call:

 AC_CHECK_TYPE(off_t, long)

 before calling AC_SYS_LARGEFILE.

That still doesn't explain how to print off_t on systems that don't
natively support it.  (Or that do, for that matter.)

Re: Please remove me from this alias

2003-09-22 Thread Hrvoje Niksic

Note that this is not an alias, it's a mailing list you must have
subscribed to before.  (We're not in the spam business just yet,
despite certain unfortunate events in the past.)  To unsubscribe,
please send mail to [EMAIL PROTECTED].

Re: Portable representation of large integers

2003-09-22 Thread Hrvoje Niksic

Daniel Stenberg [EMAIL PROTECTED] writes:

 On Mon, 22 Sep 2003, Hrvoje Niksic wrote:

 The bottom line is, I really don't know how to solve this
 portably. Does anyone know how widely ported software deals with
 large files?

 In curl, we provide our own *printf() code that works as expected on
 all platforms.

Lovely.  :-)  Wget does come with a printf implementation, but it's
used only on systems that don't have snprintf, and I'd kind of like
it to stay that way.  This is one of those wheels that are not that
much fun to reinvent.  (But then again, I thought exactly the same
about hash tables, and I ended up having to roll my own.)

 (Not that we have proper 2GB support yet anyway, but that's another
 story.  For example, we have to face the problems with exposing an
 API using such a variable type...)

Ah, they joys of writing a library...

Re: Portable representation of large integers

2003-09-23 Thread Hrvoje Niksic

DervishD [EMAIL PROTECTED] writes:

 Yes, you're true, but... How about using C99 large integer types
 (intmax_t and family)?

But then I can use `long long' just as well, which is supported by C99
and (I think) required to be at least 64 bits wide.  Portability is
the whole problem, so suggestions that throw portability out the
window aren't telling me anything new.

Using #ifdefs to switch between %d/%lld/%j *is* completely portable,
but it requires three translations for each message.  The translators
would feast on my flesh, howling at the moonlight.

Hmm.  How about preprocessing the formats before passing them to
printf?  For example, always use %j in strings, like this:

printf (FORMAT (_(whatever %j\n)), num);

On systems that support %j, FORMAT would be defined to no-op.
Otherwise, it would be defined to a format_transform function that
converts %j to either %lld or %.0f, depending on whether the system
has long long or not (in which case it would use double for large
quantities).

 That's the better I can get, because when I wrote portable code, by
 portable I understand 'according to standards'. For me that means,
 in that order: SuSv3, POSIX, C99, C89, stop. No pre-ANSI and no
 brain damaged compilers.

I understand your position -- it's perfectly valid, especially when
you have the privilege of working on a system that supports all those
standards well.  But many people don't, and Wget (along with most GNU
software of the era) was written to work for them as well.  I don't
want to support only POSIX systems for the same reason I don't want to
support only the GNU system or only the Microsoft systems.  For me,
portability is not about adhering to standards, it's about making
programs work in a wide range of environments, some of which differ
from yours.

Thanks for your suggestions.

Re: bug maybe?

2003-09-23 Thread Hrvoje Niksic

Randy Paries [EMAIL PROTECTED] writes:

 Not sure if this is a bug or not.

I guess it could be called a bug, although it's no simple oversight.
Wget currently doesn't support large files.

Wget 1.9-beta1 is available for testing

2003-09-23 Thread Hrvoje Niksic

After a lot of time of sitting in CVS, a beta of Wget 1.9 is
available.  To see what's new since 1.8, check the `NEWS' file in the
distribution.  Get it from:

http://fly.srk.fer.hr/~hniksic/wget/wget-1.9-beta1.tar.gz

Please test it on as many different platforms as possible and in the
places where Wget 1.8.x is currently being used.  I expect this
release to be extremely stable, but noone can guarantee that without
wider testing.  I didn't want to call it pre1 or rc1 lest I anger
the Gods.

One important addition scheduled for 1.9 and *not* featured in this
beta are Mauro's IPv6 improvements.  When I receive and merge Mauro's
changes, I'll release a new beta.

As always, thanks for your help.

Re: unsubscribe me now otherwise messages will bounce back to you

2003-09-24 Thread Hrvoje Niksic

To unsubscribe, send mail to [EMAIL PROTECTED].

Re: Wget 1.9-beta1 is available for testing

2003-09-24 Thread Hrvoje Niksic

DervishD [EMAIL PROTECTED] writes:

 I've got and tested it, and with NO wgetrc (it happens the same
 with my own wgetrc, but I tested clean just in case), the problem
 with the quoting still exists:

 $wget -r -c -nH ftp://user:[EMAIL PROTECTED]/Music/Joe Hisaishi
[...]
 --15:22:55--  ftp://user:[EMAIL PROTECTED]/Music%2fJoe%20Hisaishi/Joe%20Hisaishi
= `Music%2FJoe Hisaishi/.listing'

Thanks for the detailed bug report.  Although it doesn't look that
way, this problem is nothing but a simple oversight.  (A function that
was supposed to URL-encode everything except slashes failed to enforce
the exception.)  This patch should fix it:

2003-09-24  Hrvoje Niksic  [EMAIL PROTECTED]

* url.c (url_escape_1): Revert unintentional change to lowercase
xdigit escapes.
(url_escape_dir): Document that this function depends on the
output of url_escape_1.

Index: src/url.c
===
RCS file: /pack/anoncvs/wget/src/url.c,v
retrieving revision 1.94
diff -u -r1.94 url.c
--- src/url.c   2003/09/22 12:07:20 1.94
+++ src/url.c   2003/09/24 14:10:48
@@ -198,8 +198,8 @@
{
  unsigned char c = *p1++;
  *p2++ = '%';
- *p2++ = XNUM_TO_digit (c  4);
- *p2++ = XNUM_TO_digit (c  0xf);
+ *p2++ = XNUM_TO_DIGIT (c  4);
+ *p2++ = XNUM_TO_DIGIT (c  0xf);
}
   else
*p2++ = *p1++;
@@ -1130,6 +1130,7 @@
 
   for (; *h; h++, t++)
 {
+  /* url_escape_1 having converted '/' to %2F exactly. */
   if (*h == '%'  h[1] == '2'  h[2] == 'F')
{
  *t = '/';

Re: Wget 1.9-beta1 is available for testing

2003-09-25 Thread Hrvoje Niksic

Could the person who sent me the patch for Windows compilers support
please resend it?  Amidst all the viruses, I accidentally deleted the
message before I've had a chance to apply it.  Sorry about the
mistake.

Re: wget bug

2003-09-26 Thread Hrvoje Niksic

Jack Pavlovsky [EMAIL PROTECTED] writes:

 It's probably a bug: bug: when downloading wget -mirror
 ftp://somehost.org/somepath/3acv14~anivcd.mpg, wget saves it as-is,
 but when downloading wget ftp://somehost.org/somepath/3*, wget saves
 the files as 3acv14%7Eanivcd.mpg

Thanks for the report.  The problem here is that Wget tries to be
helpful by encoding unsafe characters in file names to %XX, as is
done in URLs.  Your first example works because of an oversight (!) 
that actually made Wget behave as you expected.

The good news is that the helpfulness has been rethought for the
next release and is no longer there, at least not for ordinary
characters like ~ and  .  Try getting the latest CVS sources, they
should work better in this regard.  (http://wget.sunsite.dk/ explains
how to download the source from CVS.)

Re: Windows patches

2003-09-26 Thread Hrvoje Niksic

Thanks for the patch, I've now applied it using the following
ChangeLog entry:

2003-09-26  Gisle Vanem  [EMAIL PROTECTED]

* mswindows.c (read_registry): Removed.
(set_sleep_mode): New function.
(windows_main_junk): Call it.

BTW, unless you want your patch to be reviewed by a wider audience,
you might want to send the patch to [EMAIL PROTECTED] instead.
This, as well as the ChangeLog policy and some other things, is
explained in the PATCHES document at the top level of Wget's
distribution.

Re: dificulty with Debian wget bug 137989 patch

2003-09-30 Thread Hrvoje Niksic

jayme [EMAIL PROTECTED] writes:
[...]

Before anything else, note that the patch originally written for 1.8.2
will need change for 1.9.  The change is not hard to make, but it's
still needed.

The patch didn't make it to canonical sources because it assumes `long
long', which is not available on many platforms that Wget supports.
The issue will likely be addressed in 1.10.

Having said that:

 I tried the patch Debian bug report 137989 and didnt work. Can
 anybody explain:
 1 - why I have to make to directories for patch work: one
 wget-1.8.2.orig and one wget-1.8.2 ?

You don't.  Just enter Wget's source and type `patch -p1 patchfile'.
`-p1' makes sure that the top-level directories, such as
wget-1.8.2.orig and wget-1.8.2 are stripped when finding files to
patch.

 2 - why after compilation the wget still cant download the file 
 2GB ?

I suspect you've tried to apply the patch to Wget 1.9-beta, which
doesn't work, as explained above.

Wget 1.9-beta2 is available for testing

2003-09-30 Thread Hrvoje Niksic

This beta includes several important bug fixes since 1.9-beta1, most
notably the fix for correct file name quoting with recursive FTP
downloads.  Important Windows fixes by Gisle Vanem and Herold Heiko
are also present.

Get it from:

http://fly.srk.fer.hr/~hniksic/wget/wget-1.9-beta2.tar.gz

Re: Option to save unfollowed links

2003-10-01 Thread Hrvoje Niksic

[ Added Cc to [EMAIL PROTECTED] ]

Tony Lewis [EMAIL PROTECTED] writes:

 The following patch adds a command line option to save any links
 that are not followed by wget. For example:

 wget http://www.mysite.com --mirror --unfollowed-links=mysite.links

 will result in mysite.links containing all URLs that are references
 to other sites in links on mysite.com.

I'm curious: what is the use case for this?  Why would you want to
save the unfollowed links to an external file?

Submitting a `.pot' file to the Translation Project

2003-10-01 Thread Hrvoje Niksic

Does anyone know the current procedure for submitting the `.pot' file
to the GNU Translation Project?  At the moment, the project home page
at http://www.iro.umontreal.ca/contrib/po/HTML/ appears dead.

Re: Option to save unfollowed links

2003-10-01 Thread Hrvoje Niksic

Tony Lewis [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:

 I'm curious: what is the use case for this?  Why would you want to
 save the unfollowed links to an external file?

 I use this to determine what other websites a given website refers to.

 For example:
 wget
 http://directory.google.com/Top/Regional/North_America/United_States/California/Localities/H/Hayward/
  -
 -mirror -np --unfollowed-links=hayward.out

 By looking at hayward.out, I have a list of all websites that the
 directory refers to. When I use this file, I sort it and throw away
 the Google and DMOZ links. Everything else is supposed to be
 something interesting about Hayward.

I see.  Hmm.. if you have to post-process the list anyway, wouldn't it
be more useful to have a list of *all* encountered URLs?  It might be
nice to accompany this output with the exit statuses, so people can
easily grep for 404's.

A comprehensive reporting facility has often been requested.  Perhaps
something should be done about it for the next release.

Re: Option to save unfollowed links

2003-10-01 Thread Hrvoje Niksic

Tony Lewis [EMAIL PROTECTED] writes:

 Would something like the following be what you had in mind?

 301 http://www.mysite.com/
 200 http://www.mysite.com/index.html
 200 http://www.mysite.com/followed.html
 401 http://www.mysite.com/needpw.html
 --- http://www.othersite.com/notfollowed.html

Yes, with the possible extensions of file name where the link was
saved, sensible status for non-HTTP (currently FTP) links, etc.

Re: downloading files for ftp

2003-10-01 Thread Hrvoje Niksic

Payal Rathod [EMAIL PROTECTED] writes:

 I have 5-7 user accounts in /home whose data is important. Every day at
 12:00 I want to back their data to a differnt backup machine.
 The remote machine has a ftp server.
 Can I use wget for this? If yes, how do I proceed?

The way to do it with Wget would be something like:

wget --mirror --no-host-directories ftp://username:[EMAIL PROTECTED]

It will preserve permissions.  Having said that, I believe that rsync
would be better at this because it's much more careful to correctly
transfer a directory tree from point A to point B.

(For better transfer of file names, you should also use Wget 1.9 beta
and specify `--restrict-file-names=nocontrol'.)

Wget 1.9-beta3 is available for testing

2003-10-01 Thread Hrvoje Niksic

Not many changes from the previous beta.  This is for the purposes of
the Translation Project, to which I've submitted `wget.pot', and which
might wonder where to get the source of a wget-1.9-beta3 from.

Get it from:

http://fly.srk.fer.hr/~hniksic/wget/wget-1.9-beta3.tar.gz

Mauro's IPv6 changes are not in this beta, and they might not make it
to 1.9.

Re: downloading files for ftp

2003-10-02 Thread Hrvoje Niksic

Payal Rathod [EMAIL PROTECTED] writes:

 On Wed, Oct 01, 2003 at 09:26:47PM +0200, Hrvoje Niksic wrote:
 The way to do it with Wget would be something like:
 
 wget --mirror --no-host-directories ftp://username:[EMAIL PROTECTED]

 But if I run in thru' crontab, where will it store the downloaded files?
 I want it to store as it is in server 1.

It will store them to the current directory.  You can either cd to the
desired target directory, or use the `-P' flag to specify the
directory to Wget.

Re: BUG in --timeout (exit status)

2003-10-02 Thread Hrvoje Niksic

This problem is not specific to timeouts, but to recursive download (-r).

When downloading recursively, Wget expects some of the specified
downloads to fail and does not propagate that failure to the code that
sets the exit status.  This unfortunately includes the first download,
which should probably be an exception.

Re: Submitting a `.pot' file to the Translation Project

2003-10-02 Thread Hrvoje Niksic

The home page is back, but it says that the TP Robot is dead.  I've
contacted Martin Loewis, perhaps he'll be able to provide more info.

Re: downloading files for ftp

2003-10-02 Thread Hrvoje Niksic

Payal Rathod [EMAIL PROTECTED] writes:

 On Thu, Oct 02, 2003 at 12:03:34PM +0200, Hrvoje Niksic wrote:
 Payal Rathod [EMAIL PROTECTED] writes:
 
  On Wed, Oct 01, 2003 at 09:26:47PM +0200, Hrvoje Niksic wrote:
  The way to do it with Wget would be something like:
  
  wget --mirror --no-host-directories ftp://username:[EMAIL PROTECTED]
 
  But if I run in thru' crontab, where will it store the downloaded files?
  I want it to store as it is in server 1.
 
 It will store them to the current directory.  You can either cd to the
 desired target directory, or use the `-P' flag to specify the
 directory to Wget.

 Thanks a lot. It works wonderfully. But one small thing here. I am
 trying to use it thru' cron like this,

 51 * * * * wget --mirror --no-host-directories -P /home/t1 ftp://root:[EMAIL 
 PROTECTED]//home/payal/qmail*

 But instead of delivering it to /home/t1, wget makes a directory
 /home/t1/home/payal and put the qmail* files there.

 What is the workaround for this?

Use `--cut-dirs=2', which will tell Wget to get rid of two levels of
directory hierarchy (home and payal).

Re: run_with_timeout() for Windows

2003-10-02 Thread Hrvoje Niksic

Gisle Vanem [EMAIL PROTECTED] writes:

 I've patched util.c to make run_with_timeout() work on Windows
 (better than it does with alarm()!).

Cool, thanks!  Note that, to save the honor of Unix, I've added
support for setitimer on systems that support it (virtually everything
these days), so run_with_timeout now always works with sub-second
precision.

Also, I think the Windows-specific implementation of run_with_timeout
should be entirely in mswindows.c.  The Unix one in utils.c is enough
of a soup to add the Windows version as well.  Besides, mswindows.c
can freely include all the needed headers, use MSVC++ specific
constructs, etc.

 In short it creates and starts a thread, then loops querying the
 thread exit-code. breaks if != STLL_ACTIVE, else sleep for 0.1
 sec. Uses a wget_timer too for added accuracy.

The 0.1s sleeps strike me as inefficient.  Couldn't you wait for a
condition instead?  For example:

run_with_timeout(...)
{
  initialize condvar  (pthread_cond_init)
  spawn the thread
  wait on condvar's condition with specified timeout  (pthread_cond_timedwait)
  kill the thread or not, depending on whether the above wait timed
out or not.
}

thread_helper()
{
  call fun(arg)
  signal the condvar  (pthread_cond_signal)
}

 I have a problem with run_with_timeout() returning 1 and hence
 lookup_host() reporting ETIMEDOUT. Isn't TRY_AGAIN more suited
 indicating the caller should try a longer timeout?

I'm not sure what you mean here.  Isn't the whole point of having a
DNS timeout for the program to *not* retry with a longer value, but to
give up?

Or, do you mean that Wget's *_loop functions should treat host lookup
failure due to timeout as non-fatal error?

 +  if (seconds  1.0)
 +seconds = 1.0;

Why is this necessary?  The alarm() code was doing something similar,
but that was to make sure a 0.5s timeout doesn't end up calling
alarm(0), which would mean wait forever.

BTW why are you setting the stack size to 4096 (bytes?)?  It probably
doesn't matter in the current implementation, but it might hurt other
uses of run_with_timeout.

 +  /* If we timed out kill the thread. Normal thread exitCode would be 0.
 +   */
 +  if (exitCode == STILL_ACTIVE)
 +  {
 +DEBUGN (2, (thread timed out\n));
 +exitCode = 1;
 +TerminateThread (thread_hnd, exitCode);
 +WSASetLastError (ETIMEDOUT); /* overridden by caller */

Why are you setting the error here?  The semantics of run_with_timeout
are supposed to be that error conditions are determined by whatever
FUN was doing.  If some X_with_timeout routine wants to set errno to
ETIMEDOUT, it can, but it's not run_with_timeout's job to do that.

Re: run_with_timeout() for Windows

2003-10-02 Thread Hrvoje Niksic

I've committed this patch, with minor changes, such as moving the code
to mswindows.c.  Since I don't have MSVC, someone else will need to
check that the code compiles.  Please let me know how it goes.

Re: wget 1.9 - behaviour change in recursive downloads

2003-10-03 Thread Hrvoje Niksic

It's a feature.  `-A zip' means `-A zip', not `-A zip,html'.  Wget
downloads the HTML files only because it absolutely has to, in order
to recurse through them.  After it finds the links in them, it deletes
them.

Re: some wget patches against beta3

2003-10-03 Thread Hrvoje Niksic

Thanks for the contribution.  Note that a slightly more correct place
to send the patch is the [EMAIL PROTECTED] list, followed by
people with a keener interest in development.

Also, you should send at least a short explanation of what each patch
is supposed to do and why one should apply it.  (Except in the case of
really short, self-explanatory patches, of course.)

As for the Polish translation, translations are normally handled
through the Translation Project.  The TP robot is currently down, but
I assume it will be back up soon, and then we'll submit the POT file
and update the translations /en masse/.

Re: mswindows.h patch

2003-10-03 Thread Hrvoje Niksic

Thanks for the patch, I've now applied it with the following ChangeLog
entry:

2003-10-03  Gisle Vanem  [EMAIL PROTECTED]

* connect.c: And don't include them here.

* mswindows.h: Include winsock headers here.

However, I've postponed applying the part that changes `-d'.  I agree
that `-d' could stand improvement, but let's wait with that until 1.9
is released.

Re: wget 1.9 - behaviour change in recursive downloads

2003-10-03 Thread Hrvoje Niksic

Jochen Roderburg [EMAIL PROTECTED] writes:

 Zitat von Hrvoje Niksic [EMAIL PROTECTED]:

 It's a feature.  `-A zip' means `-A zip', not `-A zip,html'.  Wget
 downloads the HTML files only because it absolutely has to, in order
 to recurse through them.  After it finds the links in them, it deletes
 them.

 Hmm, so it has really been an undetected error over all the years
 ;-) ?

s/undetected/unfixed/

At least I've always considered it an error.  I didn't know people
depended on it.

Re: run_with_timeout() for Windows

2003-10-04 Thread Hrvoje Niksic

Gisle Vanem [EMAIL PROTECTED] writes:

 Hrvoje Niksic [EMAIL PROTECTED] said:

 I've committed this patch, with minor changes, such as moving the code
 to mswindows.c.  Since I don't have MSVC, someone else will need to
 check that the code compiles.  Please let me know how it goes.

 It compiled it with MSVC okay, but crashed somewhere unrelated.
 Both before and after my patch.

In which code does it crash?  Is the crash repeatable?  If so, how do
you repeat it?

Can you see if the same crash occurs in beta1 or beta2 codebase?

Thanks.

Re: Bug in Windows binary?

2003-10-05 Thread Hrvoje Niksic

Gisle Vanem [EMAIL PROTECTED] writes:

 --- mswindows.c.org Mon Sep 29 11:46:06 2003
 +++ mswindows.c Sun Oct 05 17:34:48 2003
 @@ -306,7 +306,7 @@
  DWORD set_sleep_mode (DWORD mode)
  {
HMODULE mod = LoadLibrary (kernel32.dll);
 -  DWORD (*_SetThreadExecutionState) (DWORD) = NULL;
 +  DWORD (WINAPI *_SetThreadExecutionState) (DWORD) = NULL;
DWORD rc = (DWORD)-1;

 I assume Heiko didn't notice it because he doesn't have that
 function in his kernel32.dll. Heiko and Hrvoje, will you correct
 this ASAP?

I've now applied the patch, thanks.  I use the following ChangeLog
entry:

2003-10-05  Gisle Vanem  [EMAIL PROTECTED]

* mswindows.c (set_sleep_mode): Fix type of
_SetThreadExecutionState.

Re: subscribe wget

2003-10-06 Thread Hrvoje Niksic

To subscribe to this list, please send mail to
[EMAIL PROTECTED].

Re: can wget disable HTTP Location Forward ?

2003-10-06 Thread Hrvoje Niksic

There is currently no way to disable following redirects.  A patch to
do so has been submitted recently, but I didn't see a good reason why
one would need it, so I didn't add the option.  Your mail is a good
argument, but I don't know how prevalent that behavior is.

What is it with servers that can't be bothered to return 404?  Are
there lots of them nowadays?  Is a new default setting of Apache or
IIS to blame, or are people intentionally screwing up their
configurations?

Re: Web page source using wget?

2003-10-06 Thread Hrvoje Niksic

Tony Lewis [EMAIL PROTECTED] writes:

 wget
 http://www.custsite.com/some/page.html --http-user=USER --http-passwd=PASS

 If you supply your user ID and password via a web form, it will be
 tricky (if not impossible) because wget doesn't POST forms (unless
 someone added that option while I wasn't looking. :-)

Wget 1.9 can send POST data.

But there's a simpler way to handle web sites that use cookies for
authorization: make Wget use the site's own cookie.  Export cookies as
explained in the manual, and specify:

wget --load-cookies=COOKIE-FILE http://...

Here is an excerpt from the manual section that explains how to export
cookies.

`--load-cookies FILE'
 Load cookies from FILE before the first HTTP retrieval.  FILE is a
 textual file in the format originally used by Netscape's
 `cookies.txt' file.

 You will typically use this option when mirroring sites that
 require that you be logged in to access some or all of their
 content.  The login process typically works by the web server
 issuing an HTTP cookie upon receiving and verifying your
 credentials.  The cookie is then resent by the browser when
 accessing that part of the site, and so proves your identity.

 Mirroring such a site requires Wget to send the same cookies your
 browser sends when communicating with the site.  This is achieved
 by `--load-cookies'--simply point Wget to the location of the
 `cookies.txt' file, and it will send the same cookies your browser
 would send in the same situation.  Different browsers keep textual
 cookie files in different locations:

Netscape 4.x.
  The cookies are in `~/.netscape/cookies.txt'.

Mozilla and Netscape 6.x.
  Mozilla's cookie file is also named `cookies.txt', located
  somewhere under `~/.mozilla', in the directory of your
  profile.  The full path usually ends up looking somewhat like
  `~/.mozilla/default/SOME-WEIRD-STRING/cookies.txt'.

Internet Explorer.
  You can produce a cookie file Wget can use by using the File
  menu, Import and Export, Export Cookies.  This has been
  tested with Internet Explorer 5; it is not guaranteed to work
  with earlier versions.

Other browsers.
  If you are using a different browser to create your cookies,
  `--load-cookies' will only work if you can locate or produce a
  cookie file in the Netscape format that Wget expects.

 If you cannot use `--load-cookies', there might still be an
 alternative.  If your browser supports a cookie manager, you can
 use it to view the cookies used when accessing the site you're
 mirroring.  Write down the name and value of the cookie, and
 manually instruct Wget to send those cookies, bypassing the
 official cookie support:

  wget --cookies=off --header Cookie: NAME=VALUE

Re: Web page source using wget?

2003-10-06 Thread Hrvoje Niksic

Suhas Tembe [EMAIL PROTECTED] writes:

 Hello Everyone,

 I am new to this wget utility, so pardon my ignorance.. Here is a
 brief explanation of what I am currently doing:

 1). I go to our customer's website every day  log in using a User Name  Password.
 2). I click on 3 links before I get to the page I want.
 3). I right-click on the page  choose view source. It opens it up in Notepad.
 4). I save the source to a file  subsequently perform various tasks on that file.

 As you can see, it is a manual process. What I would like to do is
 automate this process of obtaining the source of a page using
 wget. Is this possible? Maybe you can give me some suggestions.

It's possible, in fact it's what Wget does in its most basic form.
Disregarding authentication, the recipe would be:

1) Write down the URL.

2) Type `wget URL' and you get the source of the page in file named
   SOMETHING.html, where SOMETHING is the file name that the URL ends
   with.

Of course, you will also have to specify the credentials to the page,
and Tony explained how to do that.

Wget 1.9-beta4 is available for testing

2003-10-06 Thread Hrvoje Niksic

Several bugs fixed since beta3, including a fatal one on Windows.
Includes a working Windows implementation of run_with_timeout.

Get it from:

http://fly.srk.fer.hr/~hniksic/wget/wget-1.9-beta4.tar.gz

Re: -q and -S are incompatible

2003-10-07 Thread Hrvoje Niksic

Dan Jacobson [EMAIL PROTECTED] writes:

 -q and -S are incompatible and should perhaps produce errors and be
 noted thus in the docs.

They seem to work as I'd expect -- `-q' tells Wget to print *nothing*,
and that's what happens.  The output Wget would have generated does
contain HTTP headers, among other things, but it never gets printed.

 BTW, there seems no way to get the -S output, but no progress
 indicator.  -nv, -q kill them both.

It's a bug that `-nv' kills `-S' output, I think.

 P.S. one shouldn't have to confirm each bug submission. Once should
 be enough.

You're right.  :-(  I'll ask the sunsite people if there's a way to
establish some form of white lists...

Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic

Karl Eichwalder [EMAIL PROTECTED] writes:

 Hrvoje Niksic [EMAIL PROTECTED] writes:

 As for the Polish translation, translations are normally handled
 through the Translation Project.  The TP robot is currently down, but
 I assume it will be back up soon, and then we'll submit the POT file
 and update the translations /en masse/.

 It took a little bit longer than expected but now, the robot is up and
 running again.  This morning (CET) I installed b3 for translation.

However, http://www2.iro.umontreal.ca/~gnutra/registry.cgi?domain=wget
still shows `wget-1.8.2.pot' to be the current template for [the]
domain.  Also, my Croatian translation of 1.9 doesn't seem to have
made it in.  Is that expected?

Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic

Karl Eichwalder [EMAIL PROTECTED] writes:

 Also, my Croatian translation of 1.9 doesn't seem to have made it
 in.  Is that expected?

 Unfortunately, yes.  Will you please resubmit it with the subject line
 updated (IIRC, it's now):

 TP-Robot wget-1.9-b3.hr.po

I'm not sure what b3 is, but the version in the POT file was
supposed to be beta3.  Was there a misunderstanding somewhere along
the line?

Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic

Karl Eichwalder [EMAIL PROTECTED] writes:

 Hrvoje Niksic [EMAIL PROTECTED] writes:

 I'm not sure what b3 is, but the version in the POT file was
 supposed to be beta3.  Was there a misunderstanding somewhere along
 the line?

 Yes, the robot does not like beta3 as part of the version
 string. b3 or pre3 are okay.

Ouch.  Why does the robot care about version names at all?

Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic

Karl Eichwalder [EMAIL PROTECTED] writes:

 Hrvoje Niksic [EMAIL PROTECTED] writes:

 Ouch.  Why does the robot care about version names at all?

 It must know about the sequences; this is important for merging
 issues.  IIRC, we have at least these sequences supported by the
 robot:

 1.2 - 1.2.1 - 1.2.2 - 1.3 etc.

 1.2 - 1.2a - 1.2b - 1.3

 1.2 - 1.3-pre1 - 1.3-pre2 - 1.3

 1.2 - 1.3-b1 - 1.3-b2 - 1.3

Thanks for the clarification, Karl.  But as a maintainer of a project
that tries to use the robot, I must say that I'm not happy about this.

If the robot absolutely must be able to collate versions, then it
should be smarter about it and support a larger array of formats in
use out there.  See `dpkg' for an example of how it can be done,
although the TP robot certainly doesn't need to do all that `dpkg'
does.

This way, unless I'm missing something, the robot seems to be in the
position to dictate its very narrow-minded versioning scheme to the
projects that would only like to use it (the robot).  That's really
bad.  But what's even worse is that something or someone silently
changed beta3 to b3 in the POT, and then failed to perform the
same change for my translation, which caused it to get dropped without
notice.  Returning an error that says your version number is
unparsable to this piece of software, you must use one of ...
instead would be more correct in the long run.

Is the robot written in Python?  Would you consider it for inclusion
if I donated a function that performed the comparison more fully
(provided, of course, that the code meets your standards of quality)?

Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic

Tony Lewis [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:

 Please be aware that Wget needs to know the size of the POST
 data in advance.  Therefore the argument to @code{--post-file}
 must be a regular file; specifying a FIFO or something like
 @file{/dev/stdin} won't work.

 There's nothing that says you have to read the data after you've
 started sending the POST. Why not just read the --post-file before
 constructing the request so that you know how big it is?

I don't understand what you're proposing.  Reading the whole file in
memory is too memory-intensive for large files (one could presumably
POST really huge files, CD images or whatever).

What the current code does is: determine the file size, send
Content-Length, read the file in chunks (up to the promised size) and
send those chunks to the server.  But that works only with regular
files.  It would be really nice to be able to say something like:

mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin

 My first impulse was to bemoan Wget's antiquated HTTP code which
 doesn't understand chunked transfer.  But, coming to think of it,
 even if Wget used HTTP/1.1, I don't see how a client can send
 chunked requests and interoperate with HTTP/1.0 servers.

 How do browsers figure out whether they can do a chunked transfer or
 not?

I haven't checked, but I'm 99% convinced that browsers simply don't
give a shit about non-regular files.

Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic

Karl Eichwalder [EMAIL PROTECTED] writes:

 I guess, you as the wget maintainer switched from something
 supported to the unsupported betaX scheme and now we have
 something to talk about ;)

I had no idea that something as usual as betaX was unsupported.  In
fact, I believe that bX was added when Francois saw me using it in
Wget.  :-)

 Using something different then exactly wget-1.9-b3.de.po will
 confuse the robot

sigh

 Returning an error that says your version number is unparsable to
 this piece of software, you must use one of ... instead would be
 more correct in the long run.

 Sure.  You should have receive a message like this, didn't you?

I didn't.  Maybe it was an artifact of robot not having worked at the
time, though.

Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic

Stefan Eissing [EMAIL PROTECTED] writes:

 Am Dienstag, 07.10.03, um 16:36 Uhr (Europe/Berlin) schrieb Hrvoje
 Niksic:
 What the current code does is: determine the file size, send
 Content-Length, read the file in chunks (up to the promised size) and
 send those chunks to the server.  But that works only with regular
 files.  It would be really nice to be able to say something like:

 mkisofs blabla | wget http://burner/localburn.cgi --post-file
 /dev/stdin

 That would indeed be nice. Since I'm coming from the WebDAV side
 of life: does wget allow the use of PUT?

No.

 I haven't checked, but I'm 99% convinced that browsers simply don't
 give a shit about non-regular files.

 That's probably true. But have you tried sending without
 Content-Length and Connection: close and closing the output side of
 the socket before starting to read the reply from the server?

That might work, but it sounds too dangerous to do by default, and too
obscure to devote a command-line option to.  Besides, HTTP/1.1
*requires* requests with a request-body to provide Conent-Length:

   For compatibility with HTTP/1.0 applications, HTTP/1.1 requests
   containing a message-body MUST include a valid Content-Length
   header field unless the server is known to be HTTP/1.1 compliant.

Re: [PATCH] wget-1.8.2: Portability, plus EBCDIC patch

2003-10-07 Thread Hrvoje Niksic

Martin, thanks for the patch and the detailed report.  Note that it
might have made more sense to apply the patch to the latest CVS
version, which is somewhat different from 1.8.2.

I'm really not sure whether to add this patch.  On the one hand, it's
nice to support as many architectures as possible.  But on the other
hand, most systems are ASCII.  All the systems I've ever seen or
worked on have been ASCII.  I am fairly certain that I would not be
able to support EBCDIC in the long run and that, unless someone were
to continually support EBCDIC, the existing support would bitrot away.

Is anyone on the Wget list using an EBCDIC system?

Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic

Tony Lewis [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:

 I don't understand what you're proposing.  Reading the whole file in
 memory is too memory-intensive for large files (one could presumably
 POST really huge files, CD images or whatever).

 I was proposing that you read the file to determine the length, but
 that was on the assumption that you could read the input twice,
 which won't work with the example you proposed.

In fact, it won't work with anything except regular files and links to
them.

 Can you determine if --post-file is a regular file?

Yes.

 If so, I still think you should just read (or otherwise examine) the
 file to determine the length.

That's how --post-file works now.  The problem is that it doesn't work
for non-regular files.  My first message explains it, or at least
tries to.

 For other types of input, perhaps you want write the input to a
 temporary file.

That would work for short streaming, but would be pretty bad in the
mkisofs example.  One would expect Wget to be able to stream the data
to the server, and that's just not possible if the size needs to be
known in advance, which HTTP/1.0 requires.

Re: Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Hrvoje Niksic

Josh Brooks [EMAIL PROTECTED] writes:

 I have noticed very unpredictable behavior from wget 1.8.2 -
 specifically I have noticed two things:

 a) sometimes it does not follow all of the links it should

 b) sometimes wget will follow links to other sites and URLs - when the
 command line used should not allow it to do that.

Thanks for the report.  A more detailed response follows below:

 First, sometimes when you attempt to download a site with -k -m
 (--convert-links and --mirror) wget will not follow all of the links and
 will skip some of the files!

 I have no idea why it does this with some sites and doesn't do it with
 other sites.  Here is an example that I have reproduced on several systems
 - all with 1.8.2:

Links are missed on some sites because of the use of incorrect
comments.  This has been fixed for Wget 1.9, where a more relaxed
comment parsing code is the default.  But that's not the case for
www.zorg.org/vsound/.

www.zorg.org/vsound/ contains this markup:

META NAME=ROBOTS  CONTENT=NOFOLLOW

That explicitly tells robots, such as Wget, not to follow the links in
the page.  Wget respects this and does not follow the links.  You can
tell Wget to ignore the robot directives.  For me, this works as
expected:

wget -km -e robots=off http://www.zorg.org/vsound/

You can put `robots=off' in your .wgetrc and this problem will not
bother you again.

 The second problem, and I cannot currently give you an example to try
 yourself but _it does happen_, is if you use this command line:

 wget --tries=inf -nH --no-parent
 --directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf
 --convert-links --html-extension --user-agent=Mozilla/4.0 (compatible;
 MSIE 6.0; AOL 7.0; Windows NT 5.1) www.example.com

 At first it will act normally, just going over the site in question, but
 sometimes, you will come back to the terminal and see if grabbing all
 sorts of pages from totally different sites (!)

The only way I've seen it happen is when it follows a redirection to a
different site.  The redirection is followed because it's considered
to be part of the same download.  However, further links on the
redirected site are not (supposed to be) followed.

If you have a repeatable example, please mail it here so we can
examine it in more detail.

Re: Web page source using wget?

2003-10-07 Thread Hrvoje Niksic

Suhas Tembe [EMAIL PROTECTED] writes:

 Thanks everyone for the replies so far..

 The problem I am having is that the customer is using ASP  Java
 script. The URL stays the same as I click through the links.

URL staying the same is usually a sign of the use of frame, not of ASP
and JavaScript.  Instead of looking at the URL entry field, try using
copy link to clipboard instead of clicking on the last link.  Then
use Wget on that.

Re: Web page source using wget?

2003-10-07 Thread Hrvoje Niksic

Suhas Tembe [EMAIL PROTECTED] writes:

 this page contains a drop-down list of our customer's locations.
 At present, I choose one location from the drop-down list  click
 submit to get the data, which is displayed in a report format. I
 right-click  then choose view source  save source to a file.
 I then choose the next location from the drop-down list, click
 submit again. I again do a view source  save the source to
 another file and so on for all their locations.

It's possible to automate this, but it requires some knowledge of
HTML.  Basically, you need to look at the form.../form part of the
page and find the select tag that defines the drop-down.  Assuming
that the form looks like this:

form action=http://foo.com/customer; method=GET
  select name=location
option value=caCalifornia
option value=maMassachussetts
...
  /select
/form

you'd automate getting the locations by doing something like:

for loc in ca ma ...
do
  wget http://foo.com/customer?location=$loc;
done

Wget will save the respective sources in files named
customer?location=ca, customer?location=ma, etc.

But this was only an example.  The actual process depends on what's in
the form, and it might be considerably more complex than this.

Re: Web page source using wget?

2003-10-07 Thread Hrvoje Niksic

Suhas Tembe [EMAIL PROTECTED] writes:

 It does look a little complicated This is how it looks:

 form action=InventoryStatus.asp method=post [...]
[...]
 select name=cboSupplier
 option value=4541-134289454A/option
 option value=4542-134289 selected454B/option
 /select

Those are the important parts.  It's not hard to submit this form.
With Wget 1.9, you can even use the POST method, e.g.:

wget http://.../InventoryStatus.asp --post-data \
 'cboSupplier=4541-134289status=allaction-select=Query' \
 -O InventoryStatus1.asp
wget http://.../InventoryStatus.asp --post-data \
 'cboSupplier=4542-134289status=allaction-select=Query'
 -O InventoryStatus2.asp

It might even work to simply use GET, and retrieve
http://.../InventoryStatus.asp?cboSupplier=4541-134289status=allaction-select=Query
without the need for `--post-data' or `-O', but that depends on the
ASP script that does the processing.

The harder part is to automate this process for *any* values in the
drop-down list.  You might need to use an intermediary Perl script
that extracts all the option value=... from the HTML source of the
page with the drop-down.  Then, from the output of the Perl script,
you call Wget as shown above.

It's doable, but it takes some work.  Unfortunately, I don't know of a
(command-line) tool that would make this easier.

Re: some wget patches against beta3

2003-10-08 Thread Hrvoje Niksic

[EMAIL PROTECTED] (Martin v. Löwis) writes:

 Why do you think the scheme is narrow-minded?

Because 1.9-beta3 seems to be a problem.

 VERSION = ('[.0-9]+-?b[0-9]+'
'|[.0-9]+-?dev[0-9]+'
'|[.0-9]+-?pre[0-9]+'
'|[.0-9]+-?rel[0-9]+'
'|[.0-9]+[a-z]?'
'|[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]')

But that's narrow.  Why support 1.9-b3, but not 1.9-beta3 or
1.9-alpha3, or 1.9-rc10?  Those and similar version schemes are in
wide use.

 That's really bad.  But what's even worse is that something or
 someone silently changed beta3 to b3 in the POT, and then failed
 to perform the same change for my translation, which caused it to
 get dropped without notice.

 Nothing should get dropped without a notice. [...]

I now understand that this could have been an exception due to the
outage.  But that's how it happened.  I sent the translation -- twice
-- and it got dropped.  Karl told me to resend the translation with a
1.9-b3 version (which I'd never heard of before), so I naturally
assumed that the submission had been dropped because of version.

 Now, since UMontreal has changed the translation@ alias, it might be
 that some messages were lost during the outage; this is unfortunate,
 but difficult to correct, as we cannot find out which messages might
 have lost. Fortunately, most translators know to get a message back
 from the robot for all submissions, so if they don't get one, they
 resend.

Note that I did resend, but to no avail.  My first attempt contained a
MIME attachment, which I then found out the robot didn't understand.
My second attempt was from po-mode, which should have produced a valid
message, except for the version.

Re: wget ipv6 patch

2003-10-08 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

 so, i am asking you: what do you think of these changes?

Overall they look very good!  Judging from the patch, a large piece of
the work part seems to be in an unexpected place: the FTP code.

Here are some remarks I got looking at the patch.

It inadvertently undoes the latest fnmatch move.

I still don't understand the choice to use sockaddr and
sockaddr_storage in a application code.  They result in needless casts
and (to me) uncomprehensible code.  For example, this cast:
(unsigned char *)(addr-addr_v4.s_addr) would not be necessary if the
address were defined as unsigned char[4].

I don't understand the new PASSIVE flag to lookup_host.

In lookup_host, the comment says that you don't need to call
getaddrinfo_with_timeout, but then you call getaddrinfo_with_timeout.
An oversight?

You removed this code:

-  /* ADDR is defined to be in network byte order, which is what
-this returns, so we can just copy it to STORE_IP.  However,
-on big endian 64-bit architectures the value will be stored
-in the *last*, not first four bytes.  OFFSET makes sure that
-we copy the correct four bytes.  */
-  int offset = 0;
-#ifdef WORDS_BIGENDIAN
-  offset = sizeof (unsigned long) - sizeof (ip4_address);
-#endif

But the reason the code is there is that inet_aton is not present on
all architectures, whereas inet_addr is.  So I used only inet_addr in
the IPv4 case, and inet_addr stupidly returned `long', which requires
some contortions to copy into a uchar[4] on 64-bit machines.  (I see
that inet_addr returns `in_addr_t' these days.)

If you intend to use inet_aton without checking, there should be a
fallback implementation in cmpt.c.

I note that you elided TYPE from ip_address if ENABLE_IPV6 is not
defined.  That (I think) results in code duplication in some places,
because the code effectively has to handle the IPv4 case twice:

#ifdef ENABLE_IPV6
switch (addr-type)
  {
case IPv6:
... IPv6 handling ...
break;
case IPv4:
... IPv4 handling ...
break;
  }
#else
  ... IPv4 handling because TYPE is not present without ENABLE_IPV6 ...
#endif

If it would make your life easier to add TYPE in !ENABLE_IPV6 case, so
you can write it more compactly, by all means do it.  By more
compactly I mean something code like this:

switch (addr-type)
  {
#ifdef ENABLE_IPV6
case IPv6:
... IPv6 handling ...
break;
#endif
case IPv4:
... IPv4 handling ...
break;
  }

Re: wget ipv6 patch

2003-10-08 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

 I still don't understand the choice to use sockaddr and
 sockaddr_storage in a application code.
 They result in needless casts and (to me) uncomprehensible code.

 well, using sockaddr_storage is the right way (TM) to write IPv6 enabled
 code ;-)

Not when the only thing you need is storing the result of a DNS
lookup.

I've seen the RFC, but I don't agree with it in the case of Wget.  In
fact, even the RFC states that the data structure is merely a help for
writing portable code across multiple address families and
platforms.  Wget doesn't aim for AF independence, and the
alternatives are at least as good for platform independence.

 For example, this cast: (unsigned char *)(addr-addr_v4.s_addr)
 would not be necessary if the address were defined as unsigned
 char[4].

 in_addr is the correct structure to store ipv4 addresses. using
 in_addr instead of unsigned char[4] makes much easier to copy or
 compare ipv4 addresses. moreover, you don't have to care about the
 integer size in 64-bits architectures.

An IPv4 address is nothing more than a 32-bit quantity.  I don't see
anything incorrect about using unsigned char[4] for that, and that
works perfectly fine on 64-bit architectures.

Besides, you seem to be willing to cache the string representation of
an IP address.  Why is it acceptable to work with a char *, but
unacceptable to work with unsigned char[4]?  I simply don't see that
in_addr is helping anything in host.c's code base.

 I don't understand the new PASSIVE flag to lookup_host.

 well, that's a problem. to get a socket address suitable for
 bind(2), you must call getaddrinfo with the AI_PASSIVE flag set.

Why?  The current code seems to get by without it.

There must be a way to get at the socket address without calling
getaddrinfo.

 are there __REALLY__ systems which do not support inet_aton? their
 ISVs should be ashamed of themselves...

Those systems are very old, possibly predating the very invention of
inet_aton.

 If it would make your life easier to add TYPE in !ENABLE_IPV6 case,
 so you can write it more compactly, by all means do it.  By more
 compactly I mean something code like this:

[...]
 that's a question i was going to ask you. i supposed you were
 against adding the type member to ip_address in the IPv4-only case,

Maintainability is more important than saving a few bytes per cached
IP address, especially since I don't expect the number of cache
entries to ever be large enough to make a difference.  (If someone
downloads from so many addresses that the hash table sizes become a
problem, the TYPE member will be the least of his problems.)

 P.S. please notice that by caching the string representation of IP
  addresses instead of their network representation, the code
  could become much more elegant and simple.

You said that before, but I don't quite understand why that's the
case.  It's certainly not the case for IPv4.

Re: some wget patches against beta3

2003-10-08 Thread Hrvoje Niksic

[EMAIL PROTECTED] (Martin v. Löwis) writes:

 Hrvoje Niksic [EMAIL PROTECTED] writes:

  VERSION = ('[.0-9]+-?b[0-9]+'
 '|[.0-9]+-?dev[0-9]+'
 '|[.0-9]+-?pre[0-9]+'
 '|[.0-9]+-?rel[0-9]+'
 '|[.0-9]+[a-z]?'
 '|[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]')
 
 But that's narrow.  Why support 1.9-b3, but not 1.9-beta3 or
 1.9-alpha3, or 1.9-rc10?  Those and similar version schemes are in
 wide use.

 Are you requesting the addition of these three formats?

Yes, please.

To be clear: it would be ideal if the Robot didn't care about
versioning at all.  But if it really has to, then it should support
versioning schemes in wide use.

Re: windows patch for cvs

2003-10-09 Thread Hrvoje Niksic

Thanks for the patch, Herold.  I've applied and also added similar
fixes for Borland's and Watcom's Makefiles.  I've used the following
ChangeLog entry:

2003-10-09  Herold Heiko  [EMAIL PROTECTED]

   * windows/Makefile.watcom (OBJS): Ditto.

   * windows/Makefile.src.bor: Ditto.

   * windows/wget.dep: Ditto.

   * windows/Makefile.src: Removed references to fnmatch.c and
   fnmatch.o.

Re: wget checks timestamp on wrong file

2003-10-09 Thread Hrvoje Niksic

It's a bug.  -O currently doesn't work everywhere in should.  If you
just want to change the directory where Wget operates, the workaround
is to use `-P'.  E.g.:

wget -N ftp://ftp.pld-linux.org/dists/ac/PLD/athlon/PLD/RPMS/packages.dir.mdd -P 
/root/tmp/ftp_ftp.pld-linux.org.dists.ac.PLD.athlon.PLD.RPMS

Re: wget ipv6 patch

2003-10-10 Thread Hrvoje Niksic

Mauro Tortonesi [EMAIL PROTECTED] writes:

 and i'm saying that for this task the ideal structure is
 sockaddr_storage. notice that my code uses sockaddr_storage
 (typedef'd as wget_sockaddr) only when dealing with socket
 addresses, not for ip address caching.

Now I see.  Thanks for clearing it up.

 An IPv4 address is nothing more than a 32-bit quantity.  I don't
 see anything incorrect about using unsigned char[4] for that, and
 that works perfectly fine on 64-bit architectures.

 ok, but in this way you have to call memcmp each time you want to compare
 two ip addresses and memcpy each time you want to copy an ip
 address.

Well, you can copy addresses with the assignment operator as well, as
long as they're in a `struct', as they are in the current code.  You
do need `memcmp' to compare them, but that's fine with me.

 i prefer the in_addr approach (and i don't understand why we
 shouldn't use structures like in_addr and in_addr6 which have been
 created just to do what we want: storing ip addresses)

Because they're complexly defined and hard to read if all you want is
to store 4 and 16 bytes of binary data, respectively.

 however, notice that using unsigned char[4] and unsigned char[16] is
 a less portable solution and is potentially prone to problems with
 the alignement of the sockaddr_in and sockaddr_in6 structs.

Note that I only propose using unsigned char[N] for internal storing
of addresses, such as in Wget's own `struct ip_address'.  For talking
to system API's, we can and should copy the address to the appropriate
sockaddr_* structure.  That's how the current code works, and it's
quite portable.

 Besides, you seem to be willing to cache the string representation
 of an IP address.  Why is it acceptable to work with a char *, but
 unacceptable to work with unsigned char[4]?  I simply don't see
 that in_addr is helping anything in host.c's code base.

 i would prefer to cache string representation of ip addresses
 because the ipv6 code would be much simpler and more elegant.

I agree.  My point was merely to point out that even you yourself
believe that struct in_addr* is not the only legitimate way to store
an IP address.

  I don't understand the new PASSIVE flag to lookup_host.
 
  well, that's a problem. to get a socket address suitable for
  bind(2), you must call getaddrinfo with the AI_PASSIVE flag set.

 Why?  The current code seems to get by without it.

 the problem is when you call lookup_host to get a struct to pass to
 bind(2). if you use --bind-address=localhost and you don't set the
 AI_PASSIVE flag, getaddinfo will return the 127.0.0.1 address, which
 is incorrect.

 There must be a way to get at the socket address without calling
 getaddrinfo.

 not if you want to to use --bind-address=ipv6only.domain.com.

I see.  I guess we'll have to live with it, one way or the other.
Instead of accumulating boolean arguments, lookup_host should probably
accept a FLAGS argument, so you can call it with, e.g.:

lst = lookup_host (addr, LH_PASSIVE | LH_SILENT);
...

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 1457 matches

Mail list logo