Re: Weird 302 problem with wget 1.7
John Levon [EMAIL PROTECTED] writes: Thanks very much (wouldn't it be good to refer to the clause in the RFC in the comments ?) Uh, I suppose so. But it doesn't matter that much -- someone looking for it will find it anyway. Besides, it's not clear which RFC Wget conforms to. Web standards are messy.
Re: Wget Patch for 1.8.1 witch IPv6
Thomas Lussnig [EMAIL PROTECTED] writes: 1. Now if IPv6 enabled it only fetch IPv6 IPv4 sites faile This is a problem, and part of the reason why the patch is so simple in its current form. A correct patch must modify struct address_list to hold a list of IP addresses, each of which can be either an IPv4 address or an IPv6 address. It could be something like: struct ip_address { enum { ADDR_IPV4, ADDR_IPV6 } type; union { ipv4_address ipv4; ipv6_address ipv6; } addr; }; with the appropriate #ifdefs for when IPv6 is not available. ipv6_address might also need to contain the scope information. (I don't know what that is, but I trust that you do. I've been told that IPv6 addresses were scoped.) The address_list_* functions should be modified to either return such a data structure, *or* (perhaps simpler) to provide the way for the caller to query which kind of address it's dealing with. Another possibility is to store struct sockaddr_in instead of the address. This was proposed by a Japanese developer, and I disliked that idea because it seemed cleaner to store and pass only the information we actually need. But perhaps this would be easier all around, I don't know. Also, you should get rid of the global variable named as vaguely as `family'. Also, for FTP we need to support the extended IPv6 commands. Your patch seems to introduce possibly non-portable functions such as inet_pton and gethostbyname2 without checking whether they exist. IPv6 support is not easy to add to an application heavily relying on IPv4, such as Wget. I wouldn't say that your patch is dirty or anything like it, but the fact is that in its current form it cannot resemble the changes needed to fully support IPv6.
Re: Wget Patch for 1.8.1 witch IPv6
Markus Buchhorn [EMAIL PROTECTED] writes: Reading back, that was itojun's proposal, and I suspect probably a good choice, even if it seems less clean. Itojun is one of the leading lights in IPv6 development, along with the whole WIDE group in Japan, and heavily involved in the v6 stacks for the *BSD family (Kame) and now moving into Linux (Usagi?). They're busy converting everything useful to support v6 I don't doubt that itojun is serious, but he might have different priorities. For example, he could choose to implement the easiest solution that would make it possible to patch a large number of programs in a realistic time frame. It doesn't mean that such a solution is necessarily the best one for each individual program.
Re: Wget Patch for 1.8.1 witch IPv6
Daniel Stenberg [EMAIL PROTECTED] writes: I'd suggest that you instead pass around a 'struct hostent *' on IPv4 only platforms Why? The rest of the code never needs anything from `struct hostent' except the list of addresses, and this is what my code extracts. By extension, the idea was for the IPv6 code to extract the list of addresses from the data returned by the IPv6 calls. with the appropriate #ifdefs for when IPv6 is not available. ipv6_address might also need to contain the scope information. (I don't know what that is, but I trust that you do. I've been told that IPv6 addresses were scoped.) IPv6 addresses are scoped, but that is nothing you have to care about as mere application writer (unless you really want to of course). If you just keep the list of addresses in the addrinfo struct and you try all them when you connect, then it'll work transparantly. `struct addrinfo' contains a `struct sockaddr', which carries the necessary scoping information (I think). The question at the time was whether I could extract only the address(es) and ignore everything else, as it was possible with IPv4. Itojune implied that scoping of addresses made this hard or impossible.
Re: Wget Patch for 1.8.1 witch IPv6
Daniel Stenberg [EMAIL PROTECTED] writes: On Tue, 15 Jan 2002, Hrvoje Niksic wrote: I'd suggest that you instead pass around a 'struct hostent *' on IPv4 only platforms Why? The rest of the code never needs anything from `struct hostent' except the list of addresses, and this is what my code extracts. Well, why extract the addresses when you can just leave them in the struct and pass a pointer to that? Because I'm caching the result of the lookup, and making a deep copy of `struct hostent' is not exactly easy. (Yes, I know libcurl does it, but the code is not exactly pretty, and I'd like to avoid doing that.) I am only suggesting this as it makes things a lot easier. No that's fine, but I just don't see why things are any easier that way. One way or the other, the caller will want to deal with the address -- providing it through struct hostent or through an API call to `struct address_list' should not make a difference. connect()ing on machines that support getaddrinfo() should be a matter of running through the addrinfo-list and perform something in this style: struct addrinfo *ai; sockfd = socket(ai-ai_family, ai-ai_socktype, ai-ai_protocol); rc = connect(sockfd, ai-ai_addr, ai-ai_addrlen); Except the port number can be different for each connection. And it won't work in IPv4 where I don't have `struct addrinfo' handy.
Re: Wget Patch for 1.8.1 witch IPv6
Daniel Stenberg [EMAIL PROTECTED] writes: On Tue, 15 Jan 2002, Hrvoje Niksic wrote: Well, why extract the addresses when you can just leave them in the struct and pass a pointer to that? Because I'm caching the result of the lookup, and making a deep copy of `struct hostent' is not exactly easy. (Yes, I know libcurl does it, but the code is not exactly pretty, and I'd like to avoid doing that.) No, the code doing that copy is not pretty. Deep-copying a struct like the hostent one can hardly be made pretty. Agreed. That, and the fact that I don't *need* other data from hostent, made me decide that I don't want to keep struct hostent around. The easiness comes with the fact that you have one pointer to the complete host info. Be it hostent for IPv4 or addrinfo for IPv6. Then the connect code can take that pointer and walk through the list of addresses and attempt to connect. Yes, but that's exactly the abstraction I've built for 1.8. That pointer is called `struct address_list'. struct addrinfo *ai; sockfd = socket(ai-ai_family, ai-ai_socktype, ai-ai_protocol); rc = connect(sockfd, ai-ai_addr, ai-ai_addrlen); Except the port number can be different for each connection. I think that's the intended beauty of this API. It sort of hides that fact. We don't even have to bother about which protocol it uses. It works the same. I don't think it's useful to hide the fact that one can connect to a different port of the same port. For example, `wget http://foo:80/' and `wget http://foo:81/' must connect to different ports, and I'd prefer to look up `foo' only once. But maybe I just don't see the beauty. :-) And it won't work in IPv4 where I don't have `struct addrinfo' handy. The getaddrinfo() could theoreticly work just as well on IPv4-only machines, as it is IP version unbound. Sure, but older OS'es don't have it implemented -- so I need to support the old API anyway. I still think you can do it like this: #ifdef HAVE_GETADDRINFO typedef struct addrinfo *hostinformation; #else /* current system */ typedef struct whateveryouhavetoday *hostinformation; #endif That could of course work, but it'd defeat the very idea behind the struct whateverihavetoday (`struct address_list'), which is to allow the callers to use a clean API to access the underlying host information. But, I'm talking a lot more than what I have knowledge about with regard to how the wget code is designed. My discussion is generic and may not apply to wget internals. If you have time, please take a glance at Wget 1.8.1's `host.c' and `connect.c'. I believe it will make my POV much clearer.
Re: Wget Patch for 1.8.1 witch IPv6
Thomas Lussnig [EMAIL PROTECTED] writes: Ok first we don't need this difference. I think it's not so easy than it first seem's. Because IPv6 is an superset of IPv4 there is an representation fo IPv4 Adresses. But is it desirable to use it in preference to native IPv4 calls? I apologize if I appear anal here -- it's just that if we do add IPv6 support to Wget, I'd like it to be done right, as far as that's possible. Your patch seems to introduce possibly non-portable functions such as inet_pton and gethostbyname2 without checking whether they exist. Thats correct i only knew they are avaible on Linux and BSD, but this is the ground i make the ifdefs. And i think that this to call's can checked from Makefile. Or easely replaced with more compatible (if exists). I would like to use only the tried and true calls when compiling for IPv4. The ones Wget 1.8.1 uses have been chosen for maximum portability in preference over elegance.
Re: Content-dispotion: filename=foo HTTP header
Rami Lehti [EMAIL PROTECTED] writes: Wget should try to honor Content-disposition: filename=foobar HTTP-response header. It is really a pain to try to download a file that is created by a script. Usually the server gives the Content-disposition: header You would have to save the server response somewhere and rename manually. Multiply this by a factor of 500 and you have a problem. Good suggestion. I'll put it on the TODO list and see how hard it's to implement.
Re: wget 1.8.1
Jonathan Davis [EMAIL PROTECTED] writes: I recently successfully compiled and installed wget 1.8.1 on my box. The new OS and architecture reads as follows: Mac OS X (powerpc-apple-darwin5.2) Thanks for the report; I've now updated MACHINES.
Re: using wget on local lan failed for only one website...
Boris [EMAIL PROTECTED] writes: As propose by Hrvoje, I have try with retry option, but no change, every time I've got 'read error'. I also test with the new release for windows (1.8.1), but same thing :( I have no idea what could be going on. Perhaps a Windows person might help? On Unix the error is usually accompanied with an error message slightly more informative than unknown error.
Re: WGET - OSX
Dan Lavie [EMAIL PROTECTED] writes: I have just downloaded and installed WGET on my OS-X. You didn't say where you downloaded it from or how you installed it, so I'll assume you're using the standard build process. 1- I can¹t find any documentation. The documentation is in Info format, installed in `/usr/local/info' by default. 2- How do I remove it from my OS-X system ? You pretty much need to remove the installed files manually. Wget also has an `uninstall' target, but I don't think it has been tested in a while.
Re: doubt
praveen sirivolu [EMAIL PROTECTED] writes: I have a doubt.when we use wget to recursively retrieve pages from internet its not bringing files with shtml and jhtml extensions.is this feature not implemented or if it is there ,could somebody explain me how to get those HTML pages. They should be downloaded. Can you give an example of how you're invoking Wget, preferrably accompanied by a debug log?
Re: Wget Patch for 1.8.1 witch IPv6
Daniel Stenberg [EMAIL PROTECTED] writes: `struct addrinfo' contains a `struct sockaddr', which carries the necessary scoping information (I think). The question at the time was whether I could extract only the address(es) and ignore everything else, as it was possible with IPv4. Itojune implied that scoping of addresses made this hard or impossible. Right, you can't just extract a few things from that struct and go with them without very careful considerations. Which is, if I understand correctly, exactly what Thomas does. Thomas, can you followup on this? I'm worried about the whole scoping business.
Re: -H suggestion
[EMAIL PROTECTED] writes: Funny you mention this. When I first heard about -p (1.7?) I thought exactly that it would default to [spanning hosts to retrieve page requisites]. I think it would be really useful if the page requisites could be wherever they want. I mean, -p is already ignoring -np (since 1.8?), what I think is also very useful. Since 1.8.1. I considered it a bit more dangerous to allow downloading from just any host if the user has not allowed it explicitly. For example, maybe the user doesn't want to load the banner ads? Or maybe he does? In either way, I was presented with a user interface problem. I couldn't quite figure out how to arrange the options to allow for three cases: * -p gets stuff from this host only, including requisites. * -p gets stuff from this host only, but requisites may span hosts. * everything may span hosts. Fred's suggestion raises the bar, because to implement it we'd need a set of options to juggle with the different download depths depending on whether you're referring to the starting host or to the other hosts. The -i switch provides for a file listing the URLs to be downloaded. Please provide for a list file for URLs to be avoided when -H is enabled. URLs to be avoided? Given that a URL can be named in more than one way, this might be hard to do. Sorry, but does --reject-host (or similar, I don't have the docs here ATM) not exactly do this? The existing rejection switches reject on the basis of host name, and on the basis of file name. There is no switch to disallow downloading a specific URL.
Re: IPv6
Thomas Lussnig [EMAIL PROTECTED] writes: how the socket part should work fine. inet_pton and gethostbyname2 only get used if IPV6 is defined Please don't use gethostbyname2. It's apparently a GNU extension, and I don't think it will work anywhere except on Linux. Now it leaves Makefile,evtl I don't know what this is.
Re: WGET+IPv6
Thomas Lussnig [EMAIL PROTECTED] writes: 1. without IPv6 there is no longer used new syscalls (gethostbyname2,inet_ntop,inet_pton) 2. It can on runtime downgreade to IPv4 3. In IPv6 mode it can handle IPv4 Adresses 4. Checked with following input www.ix.de , 217.110.115.160 , www.ipv6.euronet.be + Adress is shown as expected + Connection work clean :-) + checked also with IPv4 only where www.ipv6.euronet.be don't work (as expected because its ipv6 only) Cool, good work. There are still things to work on, though: * Autoconf support. Since I don't want to support broken IPv6 implementations, we don't need to get fancy here. Checking for several IPv6-specific calls and defining IPv6 only if all of them are there. There should also be a flag to turn off IPv6 entirely at compile-time. * FTP. You said you'd look for help there, but I'd at least like to make sure that IPv4 sites work with FTP, even in IPv6 mode. In fact, I dislike the idea of a mode, and I think it should be used only to downgrade to IPv4 and for debug purposes. * You haven't answered my question about scopes. Is it really safe to just store the IP address and then call it later, without also storing the address scope? Please take a careful look at this. * Style. Please take a look at Wget's coding standards (described in PATCHES file), and the accompanying GNU coding standards. Please don't use C++ comments. Please use lower-case variable names (`hostent' or `hptr' instead of HOSTENT).
Re: A strange bit of HTML
Ian Abbott [EMAIL PROTECTED] writes: I came across this extract from a table on a website: td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg'); onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td Note the string beginning msover1(, which seems to be an attribute value without a name, so that makes it illegal HTML. I think it's even worse than that. My limited knowledge of SGML taught me that foo bar is equivalent to foo bar=bar, which means that given foo bar, bar is attribute *name*, not value. If I understand SGML correctly, attribute names cannot be quoted. This makes foo bar illegal even if foo bar=10 or foo bar are perfectly valid. I haven't traced what Wget is actually doing when it encounters this, but it doesn't treat 66B27885.htm as a URL to be downloaded. According to Wget's notion of HTML, the A tag in question is simply not a well-formed tag. This means that Wget's parser will back out to the character a (the second char of a href=...) and continue parsing from there. Generally, when faced with a syntax error, it is extremely hard to just ignore it and extract a useful result from garbage. In some cases it's possible; in most, it's just too much worse. Loosely, html-parse.c will recognize the following things as tags. (S stands for strict string, only letters, numbers, hyphen and underscore allowed, L stands for loosely matched string, i.e. everything except whitespace and separator, such as quote, , etc.) I can't call this a bug, but is Wget doing the right thing by ignoring the href altogether? S S1=L1 S2=L2 ... -- normal tag with attributes S S1=L1 S2=L2 ... -- like the above, but quotation allows more leeway on values. S S1 -- the same as S S1=S1 Given the amount of broken HTML on the web, it's easy to imagine for this parser to be confused about what's what. That is why the attribute names are matched strictly. Now, it would be fairly easy to change the parser to match the attribute names loosely like it does for values, but to parse the above piece of broken HTML, it would have to be extended to handle: S L1 (and, I assume) S L1=L2 I wonder if that's worth it. On the one hand, it might be helpful to someone (e.g. you). On the other hand, there will always be one more piece of illegal HTML that Wget *could* handle if tweaked hard enough.
Re: A strange bit of HTML
[EMAIL PROTECTED] writes: That sounds like they wanted onMouseOver=msover1(...) Which Wget would, by the way, have handled perfectly.
Re: How does -P work?
Ian Abbott [EMAIL PROTECTED] writes: Here is a patch to deal with the -P C:\temp (and similar) problems on Windows. This looks good. I'll apply it as soon as CVS becomes operational again.
Re: WGET+IPv6
Daniel Stenberg [EMAIL PROTECTED] writes: On Wed, 16 Jan 2002, Hrvoje Niksic wrote: The so called scope in IPv6 is emeddeded in the address, so you can't use IPv6 addresses without getting the scope too. Are you sure? Here is what itojun said in [EMAIL PROTECTED]: due to the IPv6 address architecture (scoped), 16 bytes does not identify a node. we need 4 byte more information to identify the outgoing scope zone. Am I misunderstanding him that Wget needs to keep 20 bytes of information to successfully connect? Well, if itojun said it then that's so. I have the greatest respect for his IPv6 abilities. That may be so, but I'm not entirely convinced that it is really important that Wget care about scopes in this context. I'm still discussing it with Thomas. For example, the gethostbyname2 and getipnodebyname calls don't return the scope at all. Does it mean that all applications that use them are broken? Somehow I doubt it. (And yes, I know that both seem to be obsolete, but I still find it strange that such a feature of such importance would be missing.) If someone here understands IPv6 enough to explain this, I'd be grateful to hear the clarification.
Re: A strange bit of HTML
[EMAIL PROTECTED] writes: Until there's an ESP package that can guess what the author intended, I doubt wget has any choice but to ignore the defective tag. Seriously, I think you guys are too strict. Similar discussion have spawned numerous times. If the HTML code says a href=URL yaddayada my-Mother=Shopping%5 goingsupermarket/a Why can't wget just ignore everything after ...URL? Because, as he said, Wget can parse text, not read minds. For example, you must know where a tag ends to be able to look for the next one, or to find comments. It is not enough to look for '' to determine the tag's ending -- something like img alt=my dog src=foo is a perfectly legal tag. In other words, you have to destructure the tag, not only to retrieve the URLs, but to be able to continue parsing. If the tag is not syntactically valid, the parsing fails, on to other tags. Wget has never been able to pick apart every piece of broken HTML. As for us being strict, I can only respond with a mini-rant... Wget doesn't create web standards, but it tries to support them. Spanning the chasm between the standards as written and the actual crap generated by HTML generators feels a lot like shoveling shit. Some amount of shoveling is necessary and is performed by all small programs to protect their users, but there has to be a point where you draw the line. There is only so much shit Wget can shovel. I'm not saying Ian's example is where the line has to be drawn. (Your example is equivalent to Ian's -- Wget would only choke on the last going part). But I'm sure that the line exists and that it is not far from those two examples.
Re: wget does not parse .netrc properly
Alexey Aphanasyev [EMAIL PROTECTED] writes: I'm using wget compiled from the latest CVS sources (GNU Wget 1.8.1+cvs). I use it to mirror several ftp sites. I keep ftp accounts in .netrc file which looks like this: [...] Ah, I see. The macro definition (`macdef init') would fail to be terminated at empty lines. Thanks for the report. This patch should fix the problems; please let me know if it works for you. 2002-01-17 Hrvoje Niksic [EMAIL PROTECTED] * netrc.c (parse_netrc): Skip leading whitespace before testing whether the line is empty. Empty lines still contain the line terminator. Index: src/netrc.c === RCS file: /pack/anoncvs/wget/src/netrc.c,v retrieving revision 1.10 diff -u -r1.10 netrc.c --- src/netrc.c 2001/11/30 09:33:22 1.10 +++ src/netrc.c 2002/01/17 00:56:12 @@ -280,6 +280,10 @@ p = line; quote = 0; + /* Skip leading whitespace. */ + while (*p ISSPACE (*p)) + p ++; + /* If the line is empty, then end any macro definition. */ if (last_token == tok_macdef !*p) /* End of macro if the line is empty. */
Re: How does -P work?
Hrvoje Niksic [EMAIL PROTECTED] writes: Ian Abbott [EMAIL PROTECTED] writes: Here is a patch to deal with the -P C:\temp (and similar) problems on Windows. This looks good. I'll apply it as soon as CVS becomes operational again. Applied now.
Re: wget-1.8
Tay Ngak San [EMAIL PROTECTED] writes: I have downloaded your source code for wget and tried to make it but failed due to va_list parameter conflict in stdarg.h and stdio.h. Please advice. What OS and compiler are you using to compile Wget?
Re: wget does not parse .netrc properly
Alexey Aphanasyev [EMAIL PROTECTED] writes: It works for me. I wish the patch included in the next release. Thanks for the confirmation. The patch is already in CVS.
Re: Bug report: 1) Small error 2) Improvement to Manual
Herold Heiko [EMAIL PROTECTED] writes: My personal idea is: As a matter of fact no *windows* text editor I know of, even the supplied windows ones (notepad, wordpad) AFAIK will add the ^Z at the end of file.txt. Wget is a *windows* program (although running in console mode), not a *Dos* program (except for the real dos port I know exists but never tried out). So personally I'd say it would not be really neccessary adding support for the ^Z, even in the win32 port; That was my line of thinking too.
Re: Mapping URLs to filenames
Ian Abbott [EMAIL PROTECTED] writes: Most (all?) of the escape sequences within URLs should be decoded before transforming to an external file-name. All, I'd say. Even now u-file and u-dir are not URL-encoded. They get reencoded later, by url_filename. The point between the two is my internal or ideal file-name, but the two steps can be combined. I see what you mean. I guess u-file and u-dir constitute internal file names, and whatever url_filename() returns is external? I guess we'll have to check back in the mail archives to find out the original argument for the %-@ patch. The ChangeLog entry is dated 1997-01-18. I'll check my personal archives. At that time the patches list was not established, and many people were sending patches to me directly.
Re: Passwords and cookies
Ian Abbott [EMAIL PROTECTED] writes: - asctime (localtime ((time_t *)cookie-expiry_time)), + (cookie-expiry_time != ~0UL ? +asctime (localtime ((time_t *)cookie-expiry_time)) +: UNKNOWN), cookie-attr, cookie-value)); } Yes, except for any other values of cookie-expiry_time that would cause localtime() to return a NULL pointer Good point. (in the case of Windows, anything before 1970). Perhaps the return value of localtime() should be checked before passing it to asctime() as in the modified version of your patch I have attached below. Yes, that's the way to go. Except I'll probably add a bit more complexity (sigh) to print something other than UNKNOWN when the expiry time is ~0UL. I'm also a little worried about the (time_t *)cookie-expiry_time cast, as cookie-expiry time is of type unsigned long. Is a time_t guaranteed to be the same size as an unsigned long? It's not, but I have a hard time imagining an architecture where time_t will be *larger* than unsigned long.
Re: [PATCH] Failed linking on SCO Openserver
Jeff Bailey [EMAIL PROTECTED] writes: wget 1.8 fails to link on i686-pc-sco3.2v5.0.6 Does the compiler on that machine really not have alloca()? I'm usually wary of attempts to compile `alloca.c' because they usually point out a mistake in the configuration process.
Re: WGet 1.8.1
Lauri Mägi [EMAIL PROTECTED] writes: I'm using WGet 1.8.1 for downloading files over FTP protocol. when filename contain spaces url is like that ftp://server.name/file%20name and it saves files also with %20 in file names Prior I was using WGet 1.7 and it saved spaces as the should be. My OS is RedHat 7.2 I tried w32 version also, but it puts @20 into filenames. Is it a bug or just a feature ? It's supposed to be a feature, but many users dislike that particular feature. Which means it is likely to go away in the next release. (Some dangerous characters will still be encoded to %hh, but space is likely not to be one of them.)
Re: WGet is a very useful program. Writing a program to make the documentation easy
Michael Jennings [EMAIL PROTECTED] writes: The issue centers on the documentation. Philosophically, in my opinion, a program should be written so the documentation is easy to read. In this case a hidden stripping of useless characters means that there is one less thing to explain in the manual. No, it's one *more* thing to explain in the manual. The only characters universally agreed to be useless in the context of parsing are the whitespace characters. *Everything* else is subject to serious considerations. For example, control characters for you might be UTF8-encoded characters for someone else. Not stripping them away without a very good reason to do so is for me a simple matter of correctness. The GNU coding standards seem to suggest the same. (...) Or go for generality. For example, Unix programs often have static tables or fixed-size strings, which make for arbitrary limits; use dynamic allocation instead. Make sure your program handles NULs and other funny characters in the input files. Add a programming language for extensibility and write part of the program in that language. and: Utilities reading files should not drop NUL characters, or any other nonprinting characters _including those with codes above 0177_. The only sensible exceptions would be utilities specifically intended for interface to certain types of terminals or printers that can't handle those characters. Whenever possible, try to make programs work properly with sequences of bytes that represent multibyte characters, using encodings such as UTF-8 and others. There is precedent for this. Microsoft Windows is in some places written to get around shortcomings in the processors on which it runs. Such accommodation puts quirkiness in the code, but it gets the job done. In many cases Wget tries to accommodate to its environment to ensure smoother operation. But with each such accomodation we are forced to weigh the added quirkiness (entropy) of the code against the benefit. In this case, implementing correct support for ^Z is not exactly trivial, and the benefit is minimal -- the ^Z characters don't even appear in files normally created on platforms supported by Wget, which are Unix and Windows. You are trying to convince us otherwise by offering an easier implementation of ^Z, thereby reducing the costs. But unfortunately this easier implementation reduces correctness of the code, and is therefore not an option. Sorry.
Re: HELP: getaddrinfo
Thomas Lussnig [EMAIL PROTECTED] writes: i'm building an IPv6 patch for wget. And i'm worried about the point that i have to add 12 in the sockaddr. Perhaps it would help if you created a minimal test case for the problem you're witnessing. For example: #include stdio.h #include sys/types.h #include sys/socket.h #include arpa/inet.h #include netdb.h int main (int argc, char **argv) { char *host = argv[1]; struct addrinfo *ai_head, *ai; int err = getaddrinfo (host, NULL, NULL, ai_head); if (err != 0) return err; for (ai = ai_head; ai; ai = ai-ai_next) { char buf[128]; struct sockaddr_in6 *sa6 = (struct sockaddr_in6 *)ai-ai_addr; if (!inet_ntop (AF_INET6, sa6-sin6_addr, buf, sizeof (buf))) { perror (inet_ntop); continue; } puts (buf); } return 0; } Compile this. For me it seems to work correctly: $ ./a.out www.ipv6.euronet.be 3ffe:8100:200:2::2 3ffe:8100:200:2::2 3ffe:8100:200:2::2 ::57.0.0.0 ::57.0.0.0 ::9.11.0.0 The first three IP addresses seem to be correct (I remember them from your logs). So adding 12 does not appear to be necessary on my system. Does the same test work for you?
Re: -A -R Problems With List
Jan Hnila [EMAIL PROTECTED] writes: Hello, please try this(it should work): wget -r -l2 -A=htm,html,phtml http://www.tunedport.com (the change is the equals sign.The same for -R. If you take a look at the output of wget --help, you may notice the equality signs there(in the longer form: --accept=LIST ), but it really is easy toi find it out.) This advice is simply wrong; have you actually tried it? The option syntax is one of: wget -x value wger -xvalue wget --xxx value wget --xxx=value where x and xxx are short and long option names. I haven't yet had time to investigate Samuel's problem, but it's almost certainly not alleviated by prepending = to a one-letter option.
Re: Possible bugs when making https requests
Sacha Mallais [EMAIL PROTECTED] writes: Unable to establish SSL connection. -- Also note the it does _not_ appear to be retrying the connection. I have explicitly set --tries=5, and with a non-ssl connection, the above stuff appears 5 times when it cannot connect. But, for SSL stuff, one failure kills the process. I cannot say why the connection fails, but I can explain why it's not retried. It's because Wget (perhaps improperly) considers the failure non-transient. Such permanent failures cause Wget to give up on URL. Other permanent failures include failure to perform a DNS lookup, connection refused on all known interfaces, etc. For example: $ wget --tries=100 http://www.xemacs.org:1212 --00:11:08-- http://www.xemacs.org:1212/ = `index.html' Resolving www.xemacs.org... done. Connecting to www.xemacs.org[207.96.122.9]:1212... failed: Connection refused. $
Re: problems with char sequence %26
wget Admin [EMAIL PROTECTED] writes: I am using wget version 1.5.3 under Solaris and 1.5.2 under IRIX. Please upgrade. This problem is fixed in Wget 1.8.1. Do you have any ideas to solve the problem? (Possibly without having to recompile wget since I am not sysadmin.) You do not have to be a sysadmin to recompile Wget. Just install it into your home directory.
Re: Wget 1.6
Way, Trevor [EMAIL PROTECTED] writes: Using the -T, -t and -w parameters but cannot get it to timeout less than 3 minutes. /usr/bin/wget --output-document=/tmp/performance.html -T5 --wait=2 --waitretry=2 --tries=2 Shuld this timeout after 5 secs, retry twice, waiting 2 secs between retries. BUT it always waits 3 minutes. Note that -T only sets the read timeout, not the connect timeout.
Re: PROXY + wget ftp://my.com/pub/my*.tar
Thanos Siaperas [EMAIL PROTECTED] writes: Shouldn't wget first get the .listing, find the files needed by the wildcard, and then request the files from the proxy? This looks like a bug. No, when using a proxy, you get HTTP behavior. So to do that, you have to do it the HTTP way: wget -rl1 ftp://my.com/pub/ -A my*.tar
Re: stdout
Jens Röder [EMAIL PROTECTED] writes: for wget I would suggest a switch that allows to send the output directly to stdout. It would be easier to use it in pipes. Are you talking about the log output or the text of the documents Wget downloads? * Log output goes to stderr by default, and can be redirected by one of: wget ... 21 wget -o /dev/stdout ... # for systems with /dev/stdout wget -o /dev/fd/1 ...# for systems with /dev/fd/1 * The contents of a downloaded document can be redirected to stdout using `-O -'. Since the log output is printed to stderr, it won't be mixed up with the download output. For instance, this works as you'd expect: wget -O - http://hrvoje.willfork.com/tst/Recurdos_De_La_Alhambra.mp3 | mpg123 -
Re: Noise ratio getting a bit high?
Andre Majorel [EMAIL PROTECTED] writes: I respectfully disagree. If we can spend the time to read and answer the poster's question, the poster can spend five minutes to subscribe/unsubscribe. For reference, see the netiquette item on posting to newsgroups and asking for replies by email. I am aware of newsgroup etiquette, but I consider a newsgroup to be different from a mailing list devoted to helping users. Besides, subscribing to and unsubscribing from an unknown mailing list are much more annoying processes than they are for newsgroups. I suppose we can only agree to disagree on this one. I am aware that in this matter, as well as in the infamous `Reply-To' debate, this list lies in the minority. But that is not a sufficient reason to back down and let the spammers win. Right now, [EMAIL PROTECTED] is providing free relaying for spammers to all its subscribers. So does any mailing list with open subscription. I find your choice of wording strange, sort of like saying that `sendmail' provides free transmission of spam. That may be so, but that was not its intention, and the fact that it's misused is no reason to cripple its intended use. If you have a spam-fighting suggestion that does *not* include disallowing non-subscriber postings, I am more than willing to listen. Mmm... What would you think of having the list software automatically add a special header (say X-Non-Subscriber) to every mail sent by a non-subscriber ? I see where you're getting at, and I would have absolutely no objections to that.
Re: windows binary
Brent Morgan [EMAIL PROTECTED] writes: Whats CVS and what is the significance of this version? CVS stands for Concurrent Versions System, and is the version control system where the master sources for Wget are kept. I would not advise the download of the CVS version because it is likely to be incomplete or unstable. It would be nice if the 1.8.1+cvs binary could be moved to a less visible location, or on a separate page dedicated for development. Or accompanied by an explanation, etc.
Re: bug when processing META tag.
An, Young Hun [EMAIL PROTECTED] writes: if HTML document contains code like this meta http-equiv=Refresh wget may be crushed. It has 'refresh' but does not have 'content'. Of course this is incorrect HTML. But I found some pages at web :) simply add check routine at 'tag_handle_meta' function. Thanks for the report; this patch should fix the bug: 2002-02-01 Hrvoje Niksic [EMAIL PROTECTED] * html-url.c (tag_handle_meta): Don't crash on meta http-equiv=refresh where content is missing. Index: src/html-url.c === RCS file: /pack/anoncvs/wget/src/html-url.c,v retrieving revision 1.23 diff -u -r1.23 html-url.c --- src/html-url.c 2001/12/19 01:15:34 1.23 +++ src/html-url.c 2002/02/01 03:32:55 @@ -521,10 +521,13 @@ get to the URL. */ struct urlpos *entry; - int attrind; - char *p, *refresh = find_attr (tag, content, attrind); int timeout = 0; + char *p; + + char *refresh = find_attr (tag, content, attrind); + if (!refresh) + return; for (p = refresh; ISDIGIT (*p); p++) timeout = 10 * timeout + *p - '0';
Re: wget: malloc: Not enough memory.
Michael Dodwell [EMAIL PROTECTED] writes: Just noticed that wget 1.7 errors with the subject line if you pass it a protocol, port and username but not a password. Please upgrade to Wget 1.8.1. I believe this problem has gone away.
Re: KB or kB
Ian Abbott [EMAIL PROTECTED] writes: I'd suggest either leaving them alone or adopting the IEC standards that Henrik referred to, i.e. KiB = kibibyte = 2^10 bytes Ugh! Never! Let them keep their kibibytes to themselves. :-)
Re: wildcard(?) use inf wget 1.8
cagri coltekin [EMAIL PROTECTED] writes: Apologies if this is a known issue. However, it seems that as of wget 1.8, the `?' char is treated as a separator in URLs. But this feature brakes the ftp downloads using wild-card `?'. It would be nice to disable this in url_parse() if url is not an https* url. Thanks for the report. The problem you described should be fixed by this patch: 2002-02-19 Hrvoje Niksic [EMAIL PROTECTED] * url.c (url_parse): Don't treat '?' as query string separator when parsing FTP URLs. Index: src/url.c === RCS file: /pack/anoncvs/wget/src/url.c,v retrieving revision 1.71 diff -u -r1.71 url.c --- src/url.c 2002/01/26 20:43:17 1.71 +++ src/url.c 2002/02/19 05:07:24 @@ -802,6 +802,15 @@ query_b = p; p = strpbrk_or_eos (p, #); query_e = p; + + /* Hack that allows users to use '?' (a wildcard character) in +FTP URLs without it being interpreted as a query string +delimiter. */ + if (scheme == SCHEME_FTP) + { + query_b = query_e = NULL; + path_e = p; + } } if (*p == '#') {
Re: bug
Peteris Krumins [EMAIL PROTECTED] writes: GNU Wget 1.8 get: progress.c:673: create_image: Assertion `p - bp-buffer = bp-width' failed. This problem has been fixed in Wget 1.8.1. Please upgrade.
Re: FTP passwords?
John A Ogren [EMAIL PROTECTED] writes: I'd like to use 'wget' to mirror a remote ftp directory, but it requires a username and password to access the server. I don't see any mention of command-line options for supplying this information for an FTP server, only for an HTTP server. Is this a bug, or a feature, or am I just missing something obvious? There is a `.wgetrc' command for setting the password, which you can use on the command line with `-e' (`--execute'). For example: wget -e 'login=foo' -e 'passwd=bar' ftp://server/dir/file The same can be achieved with: wget ftp://foo:bar@server/dir/file Or you can store the username/password in `.netrc'.
Re: -N option gives proxy error
It's a known problem. Timestamping doesn't work with FTP URLs over proxy because the HEAD request is not honored by the proxy for FTP. Note that your Wget is very old and you should upgrade -- but not because of this, because this problem has remained.
Re: ftp download with -p
Currently this is a known problem. Wget doesn't span hosts or schemes with -p, although it probably should.
Re: Using wildcards through a proxy server
It's a known issue. Wget's wildcard magic only works when using the FTP protocol. HTTP is used for communication with proxies, so wildcarding doesn't work. But you should be able to simulate it using: wget -nd -rl1 -A foo*bar ftp://server/dir/ It's not elegant, but it works for me.
Re: End of IPv6 Scope_Id discussion
Again, thanks for taking the time to research this. Next time sometimes ask this question, we'll forward him this email.
Re: wget core dump with recursive file transfer
Thanks for the report, Paul. This patch, which I'm about to apply to CVS, should fix it. 2002-02-19 Hrvoje Niksic [EMAIL PROTECTED] * recur.c (retrieve_tree): Handle the case when start_url doesn't parse. Index: src/recur.c === RCS file: /pack/anoncvs/wget/src/recur.c,v retrieving revision 1.42 diff -u -r1.42 recur.c --- src/recur.c 2002/02/19 05:23:35 1.42 +++ src/recur.c 2002/02/19 06:07:39 @@ -186,15 +186,24 @@ uerr_t status = RETROK; /* The queue of URLs we need to load. */ - struct url_queue *queue = url_queue_new (); + struct url_queue *queue; /* The URLs we do not wish to enqueue, because they are already in the queue, but haven't been downloaded yet. */ - struct hash_table *blacklist = make_string_hash_table (0); + struct hash_table *blacklist; - /* We'll need various components of this, so better get it over with - now. */ - struct url *start_url_parsed = url_parse (start_url, NULL); + int up_error_code; + struct url *start_url_parsed = url_parse (start_url, up_error_code); + + if (!start_url_parsed) +{ + logprintf (LOG_NOTQUIET, %s: %s.\n, start_url, +url_error (up_error_code)); + return URLERROR; +} + + queue = url_queue_new (); + blacklist = make_string_hash_table (0); /* Enqueue the starting URL. Use start_url_parsed-url rather than just URL so we enqueue the canonical form of the URL. */
Re: wget core dump with recursive file transfer
Thanks for looking into this. I've written a slightly different fix before I saw the one from you. Your patch was *almost* correct -- one minor detail is that you don't take care to free QUEUE and BLACKLIST before exiting, therefore technically creating a (small) memory leak. My patch avoids the leak simply by making sure the return is placed before calls to url_queue_new() and make_string_hash_table().
Re: using -nd ?
Samuel Hargis [EMAIL PROTECTED] writes: I've read through the documentation and it says that (if a name shows up more than once, the filenames will get extensions '.n') Would that be like index.html duplicate would be named index.n.html or index.html.n? The latter. Also, how does it handle multiple duplicates, like say 5? It will create index.html, then index.html.1, then index.html.2, etc. wget -P ~MyUserDirectory/WincraftFolder -nd -r -l2 -p -np -t3 -T 30 -nv -A.asp,.cfm,.phtml,.shtml,.htm,.html www.wincraftusa.com There are 6 html files in this domain that kill each other out. I'm trying to get that data, all files with same names without them canceling each other out. I don't care if the names are modified when downloaded as long as I get all 6 files. Can someone please assist? Ouch. I think I understand what the problem is here. Wget deletes the index.html.n files because it thinks they're rogue HTML downloads. You can work around this bug by including '*.[0-9]' in your accept list. For instance: wget -P ~MyUserDirectory/WincraftFolder -nd -r -l2 -p -np -t3 -T 30 -nv -A.asp,.cfm,.phtml,.shtml,.htm,.html,'*[0-9]' www.wincraftusa.com Thanks for the report.
Re: wget timeout option useless
Jamie Zawinski [EMAIL PROTECTED] writes: Please also set an exit alarm around your calls to connect() based on the -T option. This is requested frequently. I'll include it in the next release. The reason why it's not already there is simply that I was lucky never to be bitten by that problem. For some reason, the systems I've worked on have always either connected or timed out in reasonable time.
Re: wget info page
Noel Koethe [EMAIL PROTECTED] writes: wget 1.8.1 is shipped with the files in doc/ wget.info wget.info-1 wget.info-2 wget.info-3 wget.info-4 Yes. As Ian said, this is so that people without `makeinfo' installed can still read the documentation. (In fact, Info pages can even be read without an Info reader.) I believe this is mandated by the GNU standards. `make distclean' doesn't remove those Info pages precisely so that they can be shipped with the release.
Re: wget info page
Noel Koethe [EMAIL PROTECTED] writes: OK. No problem for me. I just wrote this because the more interesting doc, the manpage, is not shipped with the source. I don't know how the man page is more interesting since it's a mere subset of the Info documentation. All the GNU programs are shipped with preformatted Info, and Wget is no exception there. The current man page is a compromise to appease the people who insist on having a Unix-style man page as well.
Re: RFC1806 Content-disposition patch (Take 2)
[ Adding the development list to Cc, to facilitate discussion. ] David F. Newman [EMAIL PROTECTED] writes: First of all, I think this new behaviour needs an option to enable it, rather than be on by default. The option could be called rfc1806, or rather, rfc2183 now, unless anyone can suggest a friendlier name such as --obey-content-disposition-filename! Even better, I would entirely avoid the rfc numbers in naming either command-line options *or* functions. Mentioning rfc-s in comments or in the manual is fine, of course. OK, I'll take these things into account. My concern with the valid characters was that someone doesn't specify an absolute path and pass something like /etc/passwd. So you think that stripping out the leading path is enough? That shouldn't be to tough. I'm not sure what you mean by stripping out the leading part, but what I suggest is to leave only the trailing part. So if someone specifies /etc/passwd, it's exactly the same as if he specified just passwd. And how about simply --honor-content-disposition Ian, why do you think this should not be allowed by default? A command-line option is easy to miss, and honoring this looks like a neat idea. Am I missing something?
Re: Problem with the way that wget handles %26 == '' in URLs
Robert Lupton the Good [EMAIL PROTECTED] writes: This appears to be an over-enthusistic interpretation of %26 == '' in wget. I submit a URL (which is in fact a SQL query) with some embedded s (logical ORs). These are encoded as %26, and the URL works just fine with netscape and lynx. It fails with wget. Note that wget rewrites Where+((A.status+%26+0x4)+=+0format=csv as Where ((A.status 0x4) = 0 which is a problem. wi:wget-1.8.1src/wget --version GNU Wget 1.8.1 Odd. Earlier versions of Wget did this, but 1.8.1 shouldn't. For me that doesn't seem to happen: $ wget 'http://fly.srk.fer.hr/Where+((A.status+%26+0x4)+=+0format=csv' -d DEBUG output created by Wget 1.8.1 on linux-gnu. --04:10:12-- http://fly.srk.fer.hr/Where+((A.status+%26+0x4)+=+0format=csv = `Where+((A.status++0x4)+=+0format=csv' Resolving fly.srk.fer.hr... done. Caching fly.srk.fer.hr = 161.53.70.130 Connecting to fly.srk.fer.hr[161.53.70.130]:80... connected. Created socket 3. Releasing 0x807b9e8 (new refcount 1). ---request begin--- GET /Where+((A.status+%26+0x4)+=+0format=csv HTTP/1.0 ... Looks ok to me. In your example, I also don't quite see the problem; the URL specified on the command line is identical to the one rewritten by Wget: wi:wget-1.8.1src/wget -O - 'http://skyserver.sdss.org/en/tools/search/x_sql.asp?cmd=+select+top+10+A.run,+A.camCol,+A.field,+str(A.rowc,7,2)+as+rowc,+str(A.colc,7,2)+as+colc,+str(dbo.fObjFromObjID(A.ObjId),+4)+as+id,+B.run,+B.camCol,+B.field,+str(B.rowc,7,2)+as+rowc,+str(B.colc,7,2)+as+colc,+str(dbo.fObjFromObjID(B.ObjId),+4)+as+id,+str(A.u,+5,3)+as+Au,+str(A.g,+5,3)+as+Ag,+str(A.r,+5,3)+as+Ar,+str(A.i,+5,3)+as+Ai,+str(A.u+-+B.u,+5,3)+as+du,+str(A.g+-+B.g,+5,3)+as+dg,+str(A.r+-+B.r,+5,3)+as+dr,+str(A.i+-+B.i,+5,3)+as+di+from+photoObj+as+A,+photoObj+as+B,+Neighbors+as+ObjN+Where+((A.status+%26+0x4)+=+0format=csv' --16:24:15-- http://skyserver.sdss.org/en/tools/search/x_sql.asp?cmd=+select+top+10+A.run,+A.camCol,+A.field,+str(A.rowc,7,2)+as+rowc,+str(A.colc,7,2)+as+colc,+str(dbo.fObjFromObjID(A.ObjId),+4)+as+id,+B.run,+B.camCol,+B.field,+str(B.rowc,7,2)+as+rowc,+str(B.colc,7,2)+as+colc,+str(dbo.fObjFromObjID(B.ObjId),+4)+as+id,+str(A.u,+5,3)+as+Au,+str(A.g,+5,3)+as+Ag,+str(A.r,+5,3)+as+Ar,+str(A.i,+5,3)+as+Ai,+str(A.u+-+B.u,+5,3)+as+du,+str(A.g+-+B.g,+5,3)+as+dg,+str(A.r+-+B.r,+5,3)+as+dr,+str(A.i+-+B.i,+5,3)+as+di+from+photoObj+as+A,+photoObj+as+B,+Neighbors+as+ObjN+Where+((A.status+%26+0x4)+=+0format=csv What am I missing?
Re: OK, time to moderate this list
Doug Kearns [EMAIL PROTECTED] writes: On Fri, Mar 22, 2002 at 04:08:36AM +0100, Hrvoje Niksic wrote: snip I think I agree with this. The amount of spam is staggering. I have no explanation as to why this happens on this list, and not on other lists which are *also* open to non-subscribers. I guess you are not subscribed to [EMAIL PROTECTED]? It is not just this list :-( Good to hear, for a certain deranged value of good. :-( However, I'm also subscribed to [EMAIL PROTECTED] and to [EMAIL PROTECTED] Both of them allow non-subscriber posts, and very few spam on either. Go figure.
Re: Can wget handle this scenario?
Tomislav Goles [EMAIL PROTECTED] writes: Now I need to add the twist where username account info resides on another machine (i.e. machine2 which by the way is on the same network as machine1) So I need to do something like the following: $ wget ftp://username:[EMAIL PROTECTED]@machine1.com/file.txt which is of course not the syntax wget understands. Use: $ wget ftp://username:[EMAIL PROTECTED]/file.txt However, this will not work on Wget 1.8.1 due to a bug in handling URL-encoded passwords. You can remedy that with this patch: --- wget-1.8.1.orig/src/url.c +++ wget-1.8.1/src/url.c @@ -528,6 +528,11 @@ memcpy (*user, str, len); (*user)[len] = '\0'; + if (*user) +decode_string (*user); + if (*passwd) +decode_string (*passwd); + return 1; } I plan to implement this behavior automagically when you set the FTP proxy to ftp://machine1.com/.
Re: wget reject lists
David McCabe [EMAIL PROTECTED] writes: I am having a hell of a time to get the reg-ex stuff to work with the -A or -R options. If I supply this option to my wget command: -R 1* Everything works as expected. Same with this: -R 2* Now, if I do this: -R 1*,2* I get all the files beginning with 1. if I do this: -R 2*,1* I get all the files beginning with 2. I've now tried to repeat this, but I am unable to. This will sound incredulous, but based on some reports I got, I believe what you see is a result of a Gcc bug. Specifically, gcc 2.95.something can miscompile sepstring() in utils.c. Please try recompiling Wget with `cc' or without optimization and see if it works then.
Re: Debian bug 32353 - opens a new connection for each ftpdocument.
Guillaume Morin [EMAIL PROTECTED] writes: if I use 'wget ftp://site.com/file1.txt ftp://site.com/file2.txt', wget will no reuse the ftp connection, but will open one for each document downloaded from the same site... Yes, that's how Wget currently behaves. But that's not a bug, or at least not an obvious one -- the files do get downloaded. Handling this correctly is, I believe on the TODO list, and should be classified as wish list.
Re: debian bug 32712 - wget -m sets atimet to remote mtime.
Good point there. I wonder... is there a legitimate reason to require atime to be set to the mtime time? If not, we could just make the change without the new option. In general I'm careful not to add new options unless they're really necessary.
Re: Debian bug 55145 - wget gets confused by redirects
Guillaume Morin [EMAIL PROTECTED] writes: If wget fetches a url which redirects to another host, wget retrieves the file, and there's nothing that can be done to turn that off. So, if you do wget -r on a machine that happens to have a redirect to www.yahoo.com you'll wind up trying to pull down a big chunk of yahoo. Hmm. Are you sure? Wget 1.8.1 is trying hard to restrict following redirections by applying the same rules normally used for following links. Downloading a half of Yahoo! because someone redirects to www.yahoo.com is not intended to happen. I tried to reproduce it by creating a page that redirects to www.yahoo.com, but Wget behaved correctly: $ wget -r -l0 http://muc.arsdigita.com:2005/test.tcl --19:13:53-- http://muc.arsdigita.com:2005/test.tcl = `muc.arsdigita.com:2005/test.tcl' Resolving muc.arsdigita.com... done. Connecting to muc.arsdigita.com[212.84.246.68]:2005... connected. HTTP request sent, awaiting response... 302 Found Location: http://www.yahoo.com [following] --19:13:53-- http://www.yahoo.com/ = `www.yahoo.com/index.html' Resolving www.yahoo.com... done. Connecting to www.yahoo.com[64.58.76.223]:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] [ = ] 16,82922.39K/s 19:13:55 (22.39 KB/s) - `www.yahoo.com/index.html' saved [16829] FINISHED --19:13:55-- Downloaded: 16,829 bytes in 1 files Guillaume, exactly how have you reproduced the problem?
Re: New suggestion.
Ivan Buttinoni [EMAIL PROTECTED] writes: Again I send a suggestion, this time quite easy. I hope it's not allready implemented, else I'm sorry in advance. It will be nice if wget can use the regexp to evaluate what accept/refuse to download. The regexp have to work on whole URL and/or filename and/or hostname and/or CGI argument. Sometime I found the apache directory sorting links that are unusefull, eg: .../?N=A .../?M=D Here follows an hipotesis for the above example: wget -r -l0 --reg-exclude '[A-Z]=[AD]$' http:// The problem with regexps is that their use would make Wget dependent on a regexp library. To make matters worse, regexp libraries come in all shapes and sizes, with incompatible APIs and implementing incompatible dialects of regexps. I'm staying away from regexps as long as I possibly can.
Re: Debian bug 65791 - when converting links no effort is made tohandle the '?' character
Guillaume Morin [EMAIL PROTECTED] writes: For example if a link to the URL /foo?bar is seen then the correct file is downloaded and saved with the name foo?bar. When viewing the pages with Netscape the '?' character is seen to separate the URL and the arguments. This makes the link fail. That's a known problem. The easy fix of changing ? to %3f didn't work because some browsers still fail to load the file. This problem will be fixed in a future release by changing ? to another, different character.
Re: CSS @import, NetBSD 1.5.2 ok
Martin Tsachev [EMAIL PROTECTED] writes: it compiles on i386-unknown-netbsdelf1.5.2 without any modifications I think that wget isn't parsing the @import CSS declaration, it should save those files when run with -p and convert the links if set so That is true. Parsing @import would require an (easy) change to the HTML parser and a CSS parser. Noone has stepped up to implement those yet.
Re: OK, time to moderate this list
Maciej W. Rozycki [EMAIL PROTECTED] writes: On Mon, 8 Apr 2002, Hrvoje Niksic wrote: I was also thinking about checking for `Wget' in the body, and things like that. That might be annoying (although it is certainly an option to consider anyway) as someone sending a mail legitimately may assume the matter being obvious from the list's name and definition and not repeat the program's name anywhere in the headers or the body (only a version number for example, or current when referring to a CVS snapshot). Just like I did here. ;-) Don't get me wrong, a spam detection wouldn't get discarded; it would simply need to be approved by an editor. So even the false positives would make to the list, only a tad later.
Re: -nv option; printing out infos via stderr[http://bugs.debian.org/141323]
Ian Abbott [EMAIL PROTECTED] writes: On 5 Apr 2002 at 18:17, Noel Koethe wrote: Will this be changed so the user could use -nv with /dev/null and get only errors or warnings displayed? So what I think you want is for any log message tagged as LOG_VERBOSE (verbose information) or LOG_NONVERBOSE (basic information) in the source to go to stdout when no log file has been specified and the `-O -' option has not been used and for everything else to go to stderr? That change sounds dangerous. Current Wget output doesn't really have a concept of errors that would be really separate from other output; it only operates on the level of verbosity. This was, of course, a bad design decision, and I agree that steps need to be taken to change it. I'm just not sure that this is the right step. For one, I don't know of any utility that splits its output this way. It is true that many programs print their output on stdout and errors to stderr, but Wget's log output is hardly the actual, programmatic, output of the program. That can only be the result of `-O -'. Suddenly `wget -o X' is no longer equivalent to `wget 2x', which violates the Principle of Least Surprise.
Re: Satellite [NetGain 2000] [Corruption]
Justin Piszcz [EMAIL PROTECTED] writes: --12:12:21-- ftp://war:*password*@0.0.0.0:21//iso/file.iso = `iso/file.iso' == CWD not required. == PASV ... done.== RETR file.iso ... done. Length: 737,402,880 24% [] 180,231,952 37.40K/s ETA 4:02:27 13:31:51 (37.40 KB/s) - Data connection: Connection timed out; Data transfer aborted. Retrying. This causes corruption in the file. I need to try a client which supports rollback I guess. It seems so. But I really find it strange that there would be an FTP server that fails in this scenario. I've only heard of such proxy failures, and that's not applicable to your case.
Re: Malformed status line error
Torsten Fellhauer -iXpoint- #429 [EMAIL PROTECTED] writes: when connecting to a FTP-Server using a TrendMicro Viruswall Proxy, we get the error Malformed status line, Unfortunately, Wget is right; that status line is quite different from what HTTP mandates. The status line should be something like: HTTP/1.0 200 Ok Or, more generally: HTTP/1.x status message Instead, the TrendMicro Viruswall Proxy returns: 220 InterScan Version 3.6-Build_1166 $Date: 04/24/2001 22:13:0052$ (mucint01, dynamic, get: N, put: N): Ready That is so far from HTTP that even if Wget's parser were lenient it still wouldn't make sense out of it. Is 220 an HTTP status code? If so, which one? What version of HTTP is the proxy speaking? Someone should write to the makers of TrendMicro Viruswall Proxy and ask them to fix this bug.
Re: GNU Wget 1.5.3
Matthias Jim Knopf [EMAIL PROTECTED] writes: there is a bug (or a feature...) in the version 1.5.3 Note that the latest version of Wget is 1.8.1. I suggest you to upgrade because the new version handles URLs much better. I discovered that every doubled slash (//) is converted to a single slash (/) which might make sense for real file-paths, but which does not allow me to retrieve an url like This bug still exists in the latest version, but I plan to fix it before the next release. http://my.server/some/file?get_url=http://foo.bar/ ^^ will be converted to http:/foo.bar which is not a valid url this also does not work if the value for 'get_url' is url-encoded as it should be. 1.8.1 handles this correctly when quoted. For example: $ wget -d 'http://fly.srk.fer.hr/?foo=http%2f%2fbar/baz' DEBUG output created by Wget 1.8.1 on linux-gnu. [...] ---request begin--- GET /?foo=http%2f%2fbar/baz HTTP/1.0
Current download speed in progress bar
Since I implemented the progress bar, I've progressively become more and more annoyed by the fact that the download speed it reports is the average download speed. What I'm usually much more interested in is the current download speed. This patch implements this change; the current download speed is calculated as the speed of the most recent 30 network reads. I think this makes sense -- for very downloads, you'll get the average spanning several seconds; for the fast ones, you'll get the average in this fraction of a second. This is what I want -- I think. The one remaining problem is the ETA. Based on the current speed, it changes value wildly. Of course, over time it is generally decreasing, but one can hardly follow it. I removed the flushing by making sure that it's not shown more than once per second, but this didn't fix the problem of unreliable values. Should we revert to the average speed for ETA, or is there a smarter way to handle it? What are other downloaders doing? 2002-04-09 Hrvoje Niksic [EMAIL PROTECTED] * progress.c (bar_update): Maintain an array of the time it took to perform previous 30 network reads. (create_image): Calculate the download speed and ETA based on the last 30 reads, not the entire download. (create_image): Make sure that the ETA is not changed more than once per second. Index: src/progress.c === RCS file: /pack/anoncvs/wget/src/progress.c,v retrieving revision 1.23 diff -u -r1.23 progress.c --- src/progress.c 2001/12/10 05:31:45 1.23 +++ src/progress.c 2002/04/09 18:49:45 @@ -401,6 +401,9 @@ create_image will overflow the buffer. */ #define MINIMUM_SCREEN_WIDTH 45 +/* Number of recent packets we keep the stats for. */ +#define RECENT_ARRAY_SIZE 30 + static int screen_width = DEFAULT_SCREEN_WIDTH; struct bar_progress { @@ -410,7 +413,7 @@ download finishes */ long count; /* bytes downloaded so far */ - long last_update;/* time of the last screen update. */ + long last_screen_update; /* time of the last screen update. */ int width; /* screen width we're using at the time the progress gauge was @@ -420,7 +423,27 @@ signal. */ char *buffer;/* buffer where the bar image is stored. */ - int tick; + int tick;/* counter used for drawing the + progress bar where the total size + is not known. */ + + /* The following variables (kept in a struct for namespace reasons) + keep track of how long it took to read recent packets. See + bar_update() for explanation. */ + struct { +long previous_time; +long times[RECENT_ARRAY_SIZE]; +long bytes[RECENT_ARRAY_SIZE]; +int count; +long summed_times; +long summed_bytes; + } recent; + + /* create_image() uses these to make sure that ETA information + doesn't flash. */ + long last_eta_time; /* time of the last update to download + speed and ETA. */ + long last_eta_value; }; static void create_image PARAMS ((struct bar_progress *, long)); @@ -453,7 +476,8 @@ bar_update (void *progress, long howmuch, long dltime) { struct bar_progress *bp = progress; - int force_update = 0; + int force_screen_update = 0; + int rec_index; bp-count += howmuch; if (bp-total_length 0 @@ -465,21 +489,75 @@ equal to the expected size doesn't abort. */ bp-total_length = bp-count + bp-initial_length; + /* The progress bar is supposed to display the current download + speed. The first version of the progress bar calculated it by + dividing the total amount of data with the total time needed to + download it. The problem with this was that stalled or suspended + download could unduly influence the current time. Taking just + the time needed to download the current packet would not work + either because packets arrive too fast and the varitions would be + too jerky. + + It would be preferrable to show the speed that pertains to a + recent period, say over the past several seconds. But to do this + accurately, we would have to record all the packets received + during the last five seconds. + + What we do instead is maintain a history of a fixed number of + packets. It actually makes sense if you think about it -- faster + downloads will have a faster response to speed changes. */ + + rec_index = bp-recent.count % RECENT_ARRAY_SIZE; + ++bp-recent.count; + + /* Instead of calculating the sum of times[] and bytes[], we + maintain the summed quantities. To maintain each sum, we must + make sure that it gets increased
Re: Current download speed in progress bar
Maurice Cinquini [EMAIL PROTECTED] writes: I don't think using only a fraction of a second is a reliable method for estimating current bandwidth. Here are some factors that can make for a wildly varing ETAs when just looking at the last fraction of a second. - TCP slow start. - Kernel level buffering - Other network traffic That's beside the point; this was never intended to be a scientific method of determining bandwidth. All I aimed for was something more useful than dividing total bytes with total time. And for bandwidth, I'm confident that my current method is better than what was previously in place. I'm not so sure about ETA, though. I don't like apt's method of calculating the CPS only every N seconds, because -- if I'm reading it right -- it means that you see the same value for 6 seconds, and then have to wait another 6 seconds for refresh. That sucks. `links', for example, offers both average and current speed, and the latter seems to be updated pretty swiftly. Still, thanks for the suggestions. Unless I find a really cool different suggestion, I'll fall back to the previous method for ETA.
Re: Current download speed in progress bar
Daniel Stenberg [EMAIL PROTECTED] writes: On Tue, 9 Apr 2002, Hrvoje Niksic wrote: Should we revert to the average speed for ETA, or is there a smarter way to handle it? What are other downloaders doing? I'll grab the other part and explain what curl does. It shows a current speed based on the past five seconds, Does it mean that the speed doesn't change for five seconds, or that you always show the *current* speed, but relative to the last five seconds? I may be missing something, but I don't see how to efficiently implement the latter.
Re: Current download speed in progress bar
Tony Lewis [EMAIL PROTECTED] writes: I'm often annoyed by ETA estimates that make no sense. How about showing two values -- something like: ETA at average speed: 1:05:17 ETA at current speed: 15:05 The problem is that Wget is limited by what fits in one line. I'd like to keep enough space for the progress bar, so I can add no additional information.
Re: Current download speed in progress bar
Tony Lewis [EMAIL PROTECTED] writes: Could you keep an array of speeds that is updated once a second such that the value from six seconds ago is discarded and when the value for the second that just ended is recorded? Right now I'm doing that kind of trick, but for the last N reads from the network. This translates to a larger interval for slower downloads, and the other way around, which is, I think, what one would want.
Re: Current download speed in progress bar
Andre Majorel [EMAIL PROTECTED] writes: If find it very annoying when a downloader plays yoyo with the remaining time. IMHO, remaining time is by nature a long term thing and short term jitter should not cause it to go up and down. Agreed wholeheartedly, but how would you *implement* a non-jittering ETA? Do you think it makes sense the way 1.8.1 does it, i.e. to calculate the ETA from the average speed?
Re: Current download speed in progress bar
Daniel Stenberg [EMAIL PROTECTED] writes: The meter is updated maximum once per second, I don't think it makes sense to update the screen faster than that. Maybe not, but I sort of like it. Wget's progress bar refreshes the screen (not more than) five times per second, and I like the idea of refreshing the download speed along with the amount. However, I've added the code to limit the ETA change to once per second. I've come up with a similar scheme you are describing, except I now use smaller subintervals. In other words, at compile-time you can independently choose how much you're going in the past, and in how many chunks that's divided. I've defaulted it to 3 seconds and 30 intervals, respectively. This basicly explains what curl does, not saying it is any particularly scientific way or anything, I've just found this info interesting. Thanks for the info; I appreciate it.
Re: Current download speed in progress bar
Roger L. Beeman [EMAIL PROTECTED] writes: On Wed, 10 Apr 2002, Hrvoje Niksic wrote: Agreed wholeheartedly, but how would you *implement* a non-jittering ETA? Do you think it makes sense the way 1.8.1 does it, i.e. to calculate the ETA from the average speed? One common programming technique is the exponential decay model. Sounds cool. Do you have a pseudocode or, failing that, a reference easy enough that even a programmer of Unix command-line utilities can follow it? :-) (I must admit that your email address adds a certain weight to whatever you have to say about measuring bandwidth.) I believe that the method is chosen for its simplicity and that justifications of its validity are completely after the fact. The simplicity is that one keeps a previously calculated value and averages that value with the current measurement and saves the result for the next iteration, i.e. add and shift right. I thought about calculating the average between the average and the current speed, and use that for ETA, but it sounded too arbitrary and I didn't have time to gather empirical evidence that it was any better than just using average. Again, I'd be grateful if you could provide some code. You must chose how to normalize the measurement based on irregularity in the measurement interval, however. I'm afraid I can't parse this without understanding the algorithm.
Re: Debian bug 88176 - timestamping is wrong with -O
Unfortunately, this bug is not easy to fix. The problem is that `-O' was originally invented for streaming, i.e. for `-O -'. As a result, many places in Wget's code assume that they can freely operate on the file names, and -O seems more like an afterthought. On the other hand, many people (reasonably) expect `-O x' to simply override the file name from whatever was specified in the URL to x. But the code doesn't work that way. I plan to change the handling of file names to make this work, but that will take some time. Unless someone takes time to fix this in the existing code base, the bug will remain open until the said reorganization. Until then, the workaround is to avoid the `-O -N' combination.
Re: Debian bug 106391 - documentation doesn't warn about passwordsin urls
[ Cc'ing to [EMAIL PROTECTED], as requested by Guillaume. ] Guillaume Morin [EMAIL PROTECTED] writes: this is from the advanced usage section of examples (info docs): * If you want to encode your own username and password to HTTP or FTP, use the appropriate URL syntax (*note URL Format::). wget ftp://hniksic:[EMAIL PROTECTED]/.emacs this would let other users on the system to see your password using ps. it should have a big disclaimer. You're right. I'll apply this patch, which I think should add enough warnings to educate the unwary. 2002-04-10 Hrvoje Niksic [EMAIL PROTECTED] * wget.texi: Warn about the dangers of specifying passwords on the command line and in unencrypted files. Index: doc/wget.texi === RCS file: /pack/anoncvs/wget/doc/wget.texi,v retrieving revision 1.62 diff -u -r1.62 wget.texi --- doc/wget.texi 2001/12/16 18:05:34 1.62 +++ doc/wget.texi 2002/04/10 21:40:32 @@ -285,6 +285,13 @@ @file{.netrc} file in your home directory, password will also be searched for there.} +@strong{Important Note}: if you specify a password-containing @sc{url} +on the command line, the username and password will be plainly visible +to all users on the system, by way of @code{ps}. On multi-user systems, +this is a big security risk. To work around it, use @code{wget -i -} +and feed the @sc{url}s to Wget's standard input, each on a separate +line, terminated by @kbd{C-d}. + You can encode unsafe characters in a @sc{url} as @samp{%xy}, @code{xy} being the hexadecimal representation of the character's @sc{ascii} value. Some common unsafe characters include @samp{%} (quoted as @@ -849,8 +856,15 @@ @code{digest} authentication scheme. Another way to specify username and password is in the @sc{url} itself -(@pxref{URL Format}). For more information about security issues with -Wget, @xref{Security Considerations}. +(@pxref{URL Format}). Either method reveals your password to anyone who +bothers to run @code{ps}. To prevent the passwords from being seen, +store them in @file{.wgetrc} or @file{.netrc}, and make sure to protect +those files from other users with @code{chmod}. If the passwords are +really important, do not leave them lying in those files either---edit +the files and delete them after Wget has started the download. + +For more information about security issues with Wget, @xref{Security +Considerations}. @cindex proxy @cindex cache @@ -975,6 +989,9 @@ authentication on a proxy server. Wget will encode them using the @code{basic} authentication scheme. +Security considerations similar to those with @samp{--http-passwd} +pertain here as well. + @cindex http referer @cindex referer, http @item --referer=@var{url} @@ -2409,6 +2426,10 @@ wget ftp://hniksic:mypassword@@unix.server.com/.emacs @end example +Note, however, that this usage is not advisable on multi-user systems +because it reveals your password to anyone who looks at the output of +@code{ps}. + @cindex redirecting output @item You would like the output documents to go to standard output instead of @@ -2773,10 +2794,12 @@ main issues, and some solutions. @enumerate -@item -The passwords on the command line are visible using @code{ps}. If this -is a problem, avoid putting passwords from the command line---e.g. you -can use @file{.netrc} for this. +@item The passwords on the command line are visible using @code{ps}. +The best way around it is to use @code{wget -i -} and feed the @sc{url}s +to Wget's standard input, each on a separate line, terminated by +@kbd{C-d}. Another workaround is to use @file{.netrc} to store +passwords; however, storing unencrypted passwords is also considered a +security risk. @item Using the insecure @dfn{basic} authentication scheme, unencrypted
Re: Debian bug 131851 - cwd during ftp causes download to fail
Guillaume Morin [EMAIL PROTECTED] writes: When getting a file in a non-root directory from FTP with wget, wget always tries CWD to that directory before getting the file. Unfortunately sometimes you're not allowed to CWD to a directory, but you're all allowed to list or download files from it (taken that you know the filename). I believe this breaks rfc959. I think this is quite rare, so I don't plan to add this to Wget in the near future. If someone implements it cleanly, the functionality can go in.
Re: Debian wishlist bug 21148 - wget doesn't allow selectivitybased on mime type
I believe this is already on the todo list. However, this is made harder by the fact that, to implement this kind of reject, you have to start downloading the file. This is very different from the filename-based rejection, where the decision can be made at a very early point in the download process.
Re: spaces and other special caracters in directories
Loic Le Loarer [EMAIL PROTECTED] writes: When I fetch with wget a whole subtree and when directories contains space or some other special character, these character are urlencoded in the local version while it is not the case for files. For exemple if I mirror with wget -m the directory to to which contains the file to to, I get localy the directory to%20to and the file to to. Is there an option to have the directory to to ? The inconsistency you're seeing is a bug, but the intended behavior goes into rather the opposite direction. The code was supposed to url-encode *both* the file and the directory without the option to suppress it. I will try to fix this for the next release, preferrably by uncoupling the url encoding from the protection of file names from invalid characters. Ideally, the latter would be configurable.
Re: feature wish: switch to disable robots.txt usage
Noel Koethe [EMAIL PROTECTED] writes: Ok got it. But it is possible to get this option as a switch for using it on the command line? Yes, like this: wget -erobots=off ...
Re: ftp passwords
Antonis Sidiropoulos [EMAIL PROTECTED] writes: But when the password contains characters such as '^' or space, these chars are converted in the form: %{hex code} e.g. a passwd like ^12 34 is translated to: %5E12%2034, so the login fails. Is this a bug ?? Thanks for the report. It is indeed a bug, and this patch shoud fix it: Index: src/url.c === RCS file: /pack/anoncvs/wget/src/url.c,v retrieving revision 1.68 retrieving revision 1.69 diff -u -r1.68 -r1.69 --- src/url.c 2002/01/14 01:56:40 1.68 +++ src/url.c 2002/01/14 13:26:16 1.69 @@ -528,6 +528,11 @@ memcpy (*user, str, len); (*user)[len] = '\0'; + if (*user) +decode_string (*user); + if (*passwd) +decode_string (*passwd); + return 1; }
Re: wget-1.8.1: build problems, and some patches
Nelson H. F. Beebe [EMAIL PROTECTED] writes: The wget-1.8.1 release is evidently intended to be buildable with old-style KR compilers, since it automatically detects this, and filters the source code with ansi2knr. Unfortunately, there are some syntactical things in the wget source code that ansi2knr cannot recognize, and they prevent a successful build with such a compiler (in my case, cc on HP-UX 10.01, with no c89 or gcc available on that system). Thanks a lot for looking into this. I will apply your patch with one modification: you don't need to conditionalize the use of parameters in declarations -- just wrap them in the PARAMS macro. This is important because it decreases the number of ifdef's in the code. It is good to have someone test the ansi2knr feature, since I no longer have access to systems with pre-ANSI compilers. Here are patches that I applied to get a successful build [I dealt with the log.c problem by manually inserting #undef HAVE_STDARG_H in config.h, to force it to use the old-style varargs interface.] I wonder why this was needed? Does your system have stdarg.h, and yet does not support the ANSI interface? The patch I am about to apply looks like this: Index: src/ChangeLog === RCS file: /pack/anoncvs/wget/src/ChangeLog,v retrieving revision 1.373 diff -u -r1.373 ChangeLog --- src/ChangeLog 2002/04/11 15:25:50 1.373 +++ src/ChangeLog 2002/04/11 17:06:02 @@ -1,5 +1,29 @@ 2002-04-11 Hrvoje Niksic [EMAIL PROTECTED] + * progress.c (struct progress_implementation): Use PARAMS when + declaring the parameters of *create, *update, *finish, and + *set_params. + + * netrc.c: Ditto. + + * http.c: Reformat some function definitions so that ansi2knr can + read them. + + * hash.c (struct hash_table): Use the PARAMS macro around + parameters in the declaration of hash_function and test_function. + (prime_size): Spell 2580823717UL and 3355070839UL as (unsigned + long)0x99d43ea5 and (unsigned long)0xc7fa5177 respectively, so + that pre-ANSI compilers can read them. + (find_mapping): Use PARAMS when declaring EQUALS. + (hash_table_put): Ditto. + + * ftp.h: Wrap the parameters of ftp_index declaration in PARAMS. + + * cookies.c (cookie_new): Use (unsigned long)0 instead of 0UL, + which was unsupported by pre-ANSI compilers. + +2002-04-11 Hrvoje Niksic [EMAIL PROTECTED] + * url.c (url_filename): Use compose_file_name regardless of whether opt.dirstruct is set. (mkstruct): Don't handle the query and the reencoding of DIR; that Index: src/cookies.c === RCS file: /pack/anoncvs/wget/src/cookies.c,v retrieving revision 1.18 diff -u -r1.18 cookies.c --- src/cookies.c 2001/12/10 02:29:11 1.18 +++ src/cookies.c 2002/04/11 17:06:02 @@ -84,7 +84,7 @@ /* If we don't know better, assume cookie is non-permanent and valid for the entire session. */ - cookie-expiry_time = ~0UL; + cookie-expiry_time = ~(unsigned long)0; /* Assume default port. */ cookie-port = 80; Index: src/ftp.h === RCS file: /pack/anoncvs/wget/src/ftp.h,v retrieving revision 1.13 diff -u -r1.13 ftp.h --- src/ftp.h 2002/01/25 03:34:23 1.13 +++ src/ftp.h 2002/04/11 17:06:02 @@ -107,7 +107,7 @@ struct fileinfo *ftp_parse_ls PARAMS ((const char *, const enum stype)); uerr_t ftp_loop PARAMS ((struct url *, int *)); -uerr_t ftp_index (const char *, struct url *, struct fileinfo *); +uerr_t ftp_index PARAMS ((const char *, struct url *, struct fileinfo *)); char ftp_process_type PARAMS ((const char *)); Index: src/hash.c === RCS file: /pack/anoncvs/wget/src/hash.c,v retrieving revision 1.14 diff -u -r1.14 hash.c --- src/hash.c 2001/11/17 18:03:57 1.14 +++ src/hash.c 2002/04/11 17:06:02 @@ -136,8 +136,8 @@ }; struct hash_table { - unsigned long (*hash_function) (const void *); - int (*test_function) (const void *, const void *); + unsigned long (*hash_function) PARAMS ((const void *)); + int (*test_function) PARAMS ((const void *, const void *)); int size;/* size of the array */ int count; /* number of non-empty, non-deleted @@ -177,7 +177,8 @@ 10445899, 13579681, 17653589, 22949669, 29834603, 38784989, 50420551, 65546729, 85210757, 110774011, 144006217, 187208107, 243370577, 316381771, 411296309, 534685237, 695090819, 903618083, -1174703521, 1527114613, 1985248999, 2580823717UL, 3355070839UL +1174703521, 1527114613, 1985248999, +(unsigned long)0x99d43ea5, (unsigned long)0xc7fa5177 }; int i; for (i = 0; i ARRAY_SIZE (primes); i++) @@ -236,7 +237,7 @@ struct mapping *mappings = ht
Re: wget-1.8.1: build failure on SGI IRIX 6.5 with c89
Nelson H. F. Beebe [EMAIL PROTECTED] writes: c89 -I. -I. -I/opt/include -DHAVE_CONFIG_H -DSYSTEM_WGETRC=\/usr/local/etc/wgetrc\ -DLOCALEDIR=\/usr/local/share/locale\ -O -c connect.c cc-1164 c89: ERROR File = connect.c, Line = 94 Argument of type int is incompatible with parameter of type const char *. logprintf (LOG_VERBOSE, _(Connecting to %s[%s]:%hu... ), ^ cc-1164 c89: ERROR File = connect.c, Line = 97 Argument of type int is incompatible with parameter of type const char *. The argument of type int is probably an indication that the `_' macro is either undefined or expands to an undeclared function. The compiler rightfully assumes the function to return int and complains about the type mismatch. If you check why the macro is misdeclared, you'll likely discover the source of the problem. Inasmuch as this compiler has been excellent in diagnosing violations of the 1989 ISO C Standard, and catching many portability problems, I suspect the error lies in wget. Agreed. But in this case the error is one of configuration, not programming.
Re: -k does not convert form actions
[EMAIL PROTECTED] writes: From the specification the form action= field is a uri and it can be an absolute url. So it seems it should be fixed up with the -k option just like hrefs and img srcs are. A good idea, thanks. I've attached a patch, which will be part of the next release, that implements this. Overall it would be very nice if -k where to grab an http://something it sees and convert it if it is on that server since you get urls in javascript code also that it would be nice to have convert also. That's a problem because `-k' sees only the data in tags that are defined to contain URLs. When Wget is taught to rummage through JavaScript looking for URLs, `-k' will become aware of them as well. Here is the patch: 2002-04-11 Hrvoje Niksic [EMAIL PROTECTED] * html-url.c (tag_handle_form): New function. Pick up form actions and mark them for conversion only. Index: src/html-url.c === RCS file: /pack/anoncvs/wget/src/html-url.c,v retrieving revision 1.24 diff -u -r1.24 html-url.c --- src/html-url.c 2002/02/01 03:34:31 1.24 +++ src/html-url.c 2002/04/11 17:46:52 @@ -48,6 +48,7 @@ DECLARE_TAG_HANDLER (tag_find_urls); DECLARE_TAG_HANDLER (tag_handle_base); +DECLARE_TAG_HANDLER (tag_handle_form); DECLARE_TAG_HANDLER (tag_handle_link); DECLARE_TAG_HANDLER (tag_handle_meta); @@ -73,29 +74,31 @@ { embed, tag_find_urls }, #define TAG_FIG7 { fig, tag_find_urls }, -#define TAG_FRAME 8 +#define TAG_FORM 8 + { form,tag_handle_form }, +#define TAG_FRAME 9 { frame, tag_find_urls }, -#define TAG_IFRAME 9 +#define TAG_IFRAME 10 { iframe, tag_find_urls }, -#define TAG_IMG10 +#define TAG_IMG11 { img, tag_find_urls }, -#define TAG_INPUT 11 +#define TAG_INPUT 12 { input, tag_find_urls }, -#define TAG_LAYER 12 +#define TAG_LAYER 13 { layer, tag_find_urls }, -#define TAG_LINK 13 +#define TAG_LINK 14 { link,tag_handle_link }, -#define TAG_META 14 +#define TAG_META 15 { meta,tag_handle_meta }, -#define TAG_OVERLAY15 +#define TAG_OVERLAY16 { overlay, tag_find_urls }, -#define TAG_SCRIPT 16 +#define TAG_SCRIPT 17 { script, tag_find_urls }, -#define TAG_TABLE 17 +#define TAG_TABLE 18 { table, tag_find_urls }, -#define TAG_TD 18 +#define TAG_TD 19 { td, tag_find_urls }, -#define TAG_TH 19 +#define TAG_TH 20 { th, tag_find_urls } }; @@ -141,10 +144,11 @@ from the information above. However, some places in the code refer to the attributes not mentioned here. We add them manually. */ static const char *additional_attributes[] = { - rel, /* for TAG_LINK */ - http-equiv,/* for TAG_META */ - name, /* for TAG_META */ - content/* for TAG_META */ + rel, /* used by tag_handle_link */ + http-equiv,/* used by tag_handle_meta */ + name, /* used by tag_handle_meta */ + content, /* used by tag_handle_meta */ + action /* used by tag_handle_form */ }; static const char **interesting_tags; @@ -473,6 +477,22 @@ ctx-base = uri_merge (ctx-parent_base, newbase); else ctx-base = xstrdup (newbase); +} + +/* Mark the URL found in form action=... for conversion. */ + +static void +tag_handle_form (int tagid, struct taginfo *tag, struct map_context *ctx) +{ + int attrind; + char *action = find_attr (tag, action, attrind); + if (action) +{ + struct urlpos *action_urlpos = append_one_url (action, 0, tag, +attrind, ctx); + if (action_urlpos) + action_urlpos-ignore_when_downloading = 1; +} } /* Handle the LINK tag. It requires special handling because how its
Re: Using wildcards through proxy server
John Poltorak [EMAIL PROTECTED] writes: Can anyone confirm that WGET allows the use of wildcards trhough a proxy server? It doesn't. Use a substitute: wget -rl1 -A wildcard URL...
Re: wget timeout
Warwick Poole [EMAIL PROTECTED] writes: I want to set a timeout of 5 seconds on a wget http fetch. I have tried -T --timeout etc in the command line and in a .wgetrc file. wget does not seem to obey these directives. You have probably encountered the problem that Wget's timeout is not honored for connect timeouts, only for reads. We plan to fix that for the next release.
Re: Problem with URL
Marcus - Videomoviehouse.com [EMAIL PROTECTED] writes: I am trying to get wget to work with a URL with characters that it doesn't seem to like. I tried putting the URL in quotes and still gave me similar results. Works find if it is a simple URL like wget www.something.com/index.html. Any help appreciated. I am running Red Hat Linux. I'm afraid you will need to specify exactly which URL you are having problems with, and what happens, preferrably accompanied with a log produced by running Wget with the `-d' flag. Also, please let us know which version of Wget you are using. (wget --version)
Re: wget crash
Hack Kampbjørn [EMAIL PROTECTED] writes: assertion percentage = 100 failed: file progress.c, line 552 zsh: abort (core dumped) wget -m -c --tries=0 ftp://ftp.scene.org/pub/music/artists/nutcase/mp3/timeofourlives.mp3 progress.c int percentage = (int)(100.0 * size / bp-total_length); assert (percentage = 100); Of course the assert will fail, size is bigger than total_length ! [...] To reproduce with wget-1.8.1 $ wget ftp://sunsite.dk/disk1/gnu/wget/wget-1.8{,.1}.tar.gz $ cat wget-1.8.tar.gz wget-1.8.1.tar.gz $ wget -d -c ftp://sunsite.dk/disk1/gnu/wget/wget-1.8.1.tar.gz Thanks for looking into this. There are two problems here, and most likely two separate bugs. First, I cannot repeat your test case. Maybe sunsite.dk changed their FTP server since Feb 15; anyway, what I get is: -- REST 2185627 350 Restarting at 2185627 -- RETR wget-1.8.1.tar.gz 451-Restart offset 2185627 is too large for file size 1097780. 451 Restart offset reset to 0 Wget (bogusly) considers the 451 response to be error in server response and retries. That's bug number one, but it also means that I cannot repeat your test case. Bug number two is the one the reporter saw. At first I didn't quite understand how it can happen, since bar_update() explicitly guards against such a condition: if (bp-total_length 0 bp-count + bp-initial_length bp-total_length) /* We could be downloading more than total_length, e.g. when the server sends an incorrect Content-Length header. In that case, adjust bp-total_length to the new reality, so that the code in create_image() that depends on total size being smaller or equal to the expected size doesn't abort. */ bp-total_length = bp-count + bp-initial_length; The problem is that the same guard is not implemented in bar_create() and bar_finish(), which also call create_image(). In the FTP case, the crash comes from bar_create. This patch should fix it. 2002-04-11 Hrvoje Niksic [EMAIL PROTECTED] * progress.c (bar_create): If INITIAL is larger than TOTAL, fix TOTAL. (bar_finish): Likewise. Index: src/progress.c === RCS file: /pack/anoncvs/wget/src/progress.c,v retrieving revision 1.27 diff -u -r1.27 progress.c --- src/progress.c 2002/04/11 17:49:32 1.27 +++ src/progress.c 2002/04/11 18:49:08 @@ -461,6 +461,11 @@ memset (bp, 0, sizeof (*bp)); + /* In theory, our callers should take care of this pathological + case, but it can sometimes happen. */ + if (initial total) +total = initial; + bp-initial_length = initial; bp-total_length = total; @@ -493,7 +498,7 @@ adjust bp-total_length to the new reality, so that the code in create_image() that depends on total size being smaller or equal to the expected size doesn't abort. */ -bp-total_length = bp-count + bp-initial_length; +bp-total_length = bp-initial_length + bp-count; /* This code attempts to determine the current download speed. We measure the speed over the interval of approximately three @@ -564,6 +569,11 @@ bar_finish (void *progress, long dltime) { struct bar_progress *bp = progress; + + if (bp-total_length 0 + bp-count + bp-initial_length bp-total_length) +/* See bar_update() for explanation. */ +bp-total_length = bp-initial_length + bp-count; create_image (bp, dltime); display_image (bp-buffer);
Re: wget 1.8.1 crashes on Solaris (i386 and sparc) v7 and v8, butworks on WinNT
Christopher Scott [EMAIL PROTECTED] writes: The attached file contains a link which causes wget 1.8.1 to crash on Solaris i386 and sparc, on both Solaris 7 and 8 on both platforms. However, I downloaded the latest version for Windows, and it ran correctly!?! I'm afraid I cannot get Wget to dump core downloading this link, either from the command line or from a `lnk.txt' file. Try recompiling Wget with debugging information (make clean; make CFLAGS=-g). When it crashes, run `gdb wget core' and type `where'. Mail the output here. Thanks for the report.
Re: No clobber and .shtml files
This change is fine with me. I vaguely remember that this test is performed in two places; you might want to create a function.
Re: ETA on wget timeout option
Christopher H. Taylor [EMAIL PROTECTED] writes: Any ETA on when you're going to add a timeout alarm to the connect() function? I'm running 1.8.1 and still have the same problem. Many of my applications that utilize wget are time critical and I'm anxiously awaiting this fix. Thanks for your reply. I've just implemented this, and it's been passing my initial tests. I'll apply it to CVS shortly. I can provide the patch, but it's against the latest CVS and is likely not to apply to the 1.8.1 sources. However, if it's critical for you, you can grab the latest CVS sources and use that. I believe the current CVS is at least as stable as 1.8.1.
Re: No clobber and .shtml files
Ian Abbott [EMAIL PROTECTED] writes: On 11 Apr 2002 at 21:00, Hrvoje Niksic wrote: This change is fine with me. I vaguely remember that this test is performed in two places; you might want to create a function. Certainly. Where's the best place for it? utils.c? As good a place as any.
Re: /usr/include/stdio.h:120: previous declaration of `va_list'
Kevin Rodgers [EMAIL PROTECTED] writes: 1. Don't #define _XOPEN_SOURCE 500 (by commenting it out). 2. Do #define _VA_ALIST. I can confirm that (1) works. I didn't try (2). Could you please try (2) and see if it works out? I'm reluctant to withdraw the _XOPEN_SOURCE definition because it's supposed to create the kind of environment that we want -- standards-compliant with useful extensions. Without it, some functions we use just don't get declared. (I think strptime is one of them, but there are probably more.) I'm keeping that option as a last resort. Thanks for the report and the analysis.
Re: Goodbye and good riddance
James C. McMaster (Jim) [EMAIL PROTECTED] writes: This could be a great resource, but (I hate to say this) it has been rendered more trouble than it is worth by the stubbornness and stupidity of the owner. He has turned a deaf ear to all pleas to do something, ANYTHING, to stop the flood of spam, viruses and annoyances posted to the list. Actually, I was planning to work on the spam problem this weekend. (Don't for a moment think I'm not annoyed by it.) It *will* be resolved, hopefully to everyone's satisfaction. But if several spams are enough to detract you from a useful resource and resort to name-calling targeted at the very person who created it, I cannot honestly feel dismayed by your choice. This is the one and only mailing list that still maintains this policy, This is a factually incorrect statement. I will continue to use it without support, because getting support is more trouble than it is worth. Don't forget that you can always post to the mailing list *without* being subscribed. :-) Who knows, maybe one day you'll reap the benefits of what you are badmouthing right now. I respectfully ask the other participants to extend their patience for some more days. I apologize for not having provided a better solution already. Despite the insults, I do not deny my part of the blame -- it is just your method (of dealing with spam) I disagree with.