Re: Bug report
Gary Reysa wrote: Hi, I don't really know if this is a Wget bug, or some problem with my website, but, either way, maybe you can help. I have a web site ( www.BuildItSolar.com ) with perhaps a few hundred pages (260MB of storage total). Someone did a Wget on my site, and managed to log 111,000 hits and 58,000 page views (using more than a GB of bandwidth). I am wondering how this can happen, since the number of page views is about 200 times the number of pages on my site?? Is there something I can do to prevent this? Is there something about the organization of my website that is causing Wget to get stuck in a loop? I've never used Wget, but I am guessing that this guy really did not want 50,000+ pages -- do you provide some way for the user to shut itself down when it reaches some reasonable limit? My website is non-commercial, and provides a lot of information that people find useful in building renewable energy projects. It generates zero income, and I can't really afford to have a lot of people come in and burn up GBs of bandwidth to no useful end. Help! Gary Reysa Bozeman, MT [EMAIL PROTECTED] Hello Gary, From a quick look at your site, it appears to be mainly static html that would not generate a lot of extra crawls. If you have some dynamic portion of your site, like a calendar, that could make wget go into an infinite loop. It would be much easier to tell if you could look at the server logs that show what pages were requested. They would easily tell you want wget was getting hung on. One problem I did notice is that your site is generating soft 404s. In other words, it is sending back a http 200 response when it should be sending back a 404 response. So if wget tries to access http://www.builditsolar.com/blah your web server is telling wget that the page actually exists. This *could* cause more crawls than necessary, but not likely. This problem should be fixed though. It's possible the wget user did not know what they were doing and ran the crawler several times. You could try to block traffic from that particular IP address or create a robots.txt file that tells crawlers to stay away from your site or just certain pages. Wget respects robots.txt. For more info: http://www.robotstxt.org/wc/robots.html Regards, Frank
Re: Bug report
Juhana Sadeharju [EMAIL PROTECTED] writes: Command: wgetdir http://liarliar.sourceforge.net;. Problem: Files are named as content.php?content.2 content.php?content.3 content.php?content.4 which are interpreted, e.g., by Nautilus as manual pages and are displayed as plain texts. Could the files and the links to them renamed as the following? content.php?content.2.html content.php?content.3.html content.php?content.4.html Use the option `--html-extension' (-E). After all, are those pages still php files or generated html files? If they are html files produced by the php files, then it could be a good idea to add a new extension to the files. They're the latter -- HTML files produced by the server-side PHP code. Command: wgetdir http://www.newtek.com/products/lightwave/developer/lscript2.6/index.html; Problem: Images are not downloaded. Perhaps because the image links are the following: image src=v26_2.jpg I've never seen this tag, but it seems to be the same as IMG. Mozilla seems to grok it and its DOM inspector thinks it has seen IMG. Is this tag documented anywhere? Does IE understand it too?
Re: bug report and patch, HTTPS recursive get
In message Re: bug report and patch, HTTPS recursive get, Ian Abbott wrote... Thanks again for the bug report and the proposed patch. I thought some of the scheme tests in recur.c were getting messy, so propose the following patch that uses a function to check for similar schemes. Thanks for your rewriting. By your patch, the problem was solved. Thankyou --- Doumae Kiyotaka Internet Initiative Japan Inc. Technical Planning Division
Re: bug report and patch, HTTPS recursive get
On Wed, 15 May 2002 18:44:19 +0900, Kiyotaka Doumae [EMAIL PROTECTED] wrote: I found a bug of wget with HTTPS resursive get, and proposal a patch. Thanks for the bug report and the proposed patch. The current scheme comparison checks are getting messy, so I'll write a function to check schemes for similarity (when I can spare the time later today).
Re: Bug report
On Fri, 3 May 2002 18:37:22 +0200, Emmanuel Jeandel [EMAIL PROTECTED] wrote: ejeandel@yoknapatawpha:~$ wget -r a:b Segmentation fault Patient: Doctor, it hurts when I do this Doctor: Well don't do that then! Seriously, this is already fixed in CVS.
Re: Bug report: 1) Small error 2) Improvement to Manual
On 17 Jan 2002 at 2:15, Hrvoje Niksic wrote: Michael Jennings [EMAIL PROTECTED] writes: WGet returns an error message when the .wgetrc file is terminated with an MS-DOS end-of-file mark (Control-Z). MS-DOS is the command-line language for all versions of Windows, so ignoring the end-of-file mark would make sense. Ouch, I never thought of that. Wget opens files in binary mode and handles the line termination manually -- but I never thought to handle ^Z. Why not just open the wgetrc file in text mode using fopen(name, r) instead of rb? Does that introduce other problems? In the Windows C compilers I've tried (Microsoft and Borland ones), r causes the file to be opened in text mode by default (there are ways to override that at compile time and/or run time), and this causes the ^Z to be treated as an EOF (there might be ways to override that too).
Re: Bug report: 1) Small error 2) Improvement to Manual
WGet returns an error message when the .wgetrc file is terminated with an MS-DOS end-of-file mark (Control-Z). MS-DOS is the command-line language for all versions of Windows, so ignoring the end-of-file mark would make sense. Ouch, I never thought of that. Wget opens files in binary mode and handles the line termination manually -- but I never thought to handle ^Z. Why not just open the wgetrc file in text mode using fopen(name, r) instead of rb? Does that introduce other problems? In the Windows C compilers I've tried (Microsoft and Borland ones), r causes the file to be opened in text mode by default (there are ways to override that at compile time and/or run time), and this causes the ^Z to be treated as an EOF (there might be ways to override that too). I think it has to do with comments because the defeinition is that starting with '#' the rest of the line is ignored. And an line ends with '\n' or the end of the file and not with and spezial charakter '\0' that mean for me that to abort the reading of an textfile when zero isfound mean's incorrect parsing. Cu Thomas Lußnig smime.p7s Description: S/MIME Cryptographic Signature
Re: Bug report: 1) Small error 2) Improvement to Manual
On 21 Jan 2002 at 14:56, Thomas Lussnig wrote: Why not just open the wgetrc file in text mode using fopen(name, r) instead of rb? Does that introduce other problems? I think it has to do with comments because the defeinition is that starting with '#' the rest of the line is ignored. And an line ends with '\n' or the end of the file and not with and spezial charakter '\0' that mean for me that to abort the reading of an textfile when zero isfound mean's incorrect parsing. (N.B. the control-Z character would be '\032', not '\0'.) So maybe just mention in the documentation that the wgetrc file is considered to be a plain text file, whatever that means for the system Wget is running on. Maybe mention peculiaries of DOS/Windows, etc. In general, it is more portable to read or write native text files in text mode as it performs whatever local conversions are necessary to make reads and writes of text files appear like UNIX i.e. each line of text terminated by a newline '\n'). In binary mode, what you get depends on the system (Mac text files have lines terminated by carriage return ('\r') for example, and some systems (VMS?) don't even have line termination characters as such.) In the case of Wget, log files are already written in text mode. I think wgetrc needs to be read in text mode and that's an easy change. In the case of the --input-file option, ideally the input file should be read in text mode unless the --force-html option is used, in which case it should be read in the same mode as when parsing other locally-stored HTML files. Wget stores retrieved files in binary mode but the mode used when reading those locally-stored files is less precise (not that it makes much difference for UNIX). It uses open() (not fopen()) and read() to read those files into memory (or uses mmap() to map them into memory space if supported). The DOS/Windows version of open() allows you to specify text or binary mode, defaulting to text mode, so it looks like the Windows version of Wget saves html files in binary mode and reads them back in in text mode! Well whatever - the HTML parser still seems to work okay on Windows, probably because HTML isn't that fussy about line-endings anyway! So to support --input-file portably (not the --force-html version), the get_urls_file() function in url.c should probably call a new function read_file_text() (or read_text_file() instead of read_file() as it does at the moment. For UNIX-type systems, that could just fall back to calling read_file(). The local HTML file parsing stuff should probably be left well alone but possibly add some #ifdef code for Windows to open the file in binary mode, though there may be differences between compilers for that.
RE: Bug report: 1) Small error 2) Improvement to Manual
On 17/01/2002 07:34:05 Herold Heiko wrote: [proper order restored] -Original Message- From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]] Sent: Thursday, January 17, 2002 2:15 AM To: Michael Jennings Cc: [EMAIL PROTECTED] Subject: Re: Bug report: 1) Small error 2) Improvement to Manual Michael Jennings [EMAIL PROTECTED] writes: 1) There is a very small bug in WGet version 1.8.1. The bug occurs when a .wgetrc file is edited using an MS-DOS text editor: WGet returns an error message when the .wgetrc file is terminated with an MS-DOS end-of-file mark (Control-Z). MS-DOS is the command-line language for all versions of Windows, so ignoring the end-of-file mark would make sense. Ouch, I never thought of that. Wget opens files in binary mode and handles the line termination manually -- but I never thought to handle ^Z. As much as I'd like to be helpful, I must admit I'm loath to encumber the code with support for this particular thing. I have never seen it before; is it only an artifact of DOS editors, or is it used on Windows too? [snip copy con file.txt] However in this case (at least when I just tried) the file won't contain the ^Z. OTOH some DOS programs still will work on NT4, NT2k and XP, and could be used, and would create files ending with ^Z. But do they really belong here and should wget be bothered ? What we really need to know is: Is ^Z still a valid, recognized character indicating end-of-file (for textmode files) for command shell programs on windows NT 4/2k/Xp ? Somebody with access to the *windows standards* could shed more light on this question ? My personal idea is: As a matter of fact no *windows* text editor I know of, even the supplied windows ones (notepad, wordpad) AFAIK will add the ^Z at the end of file.txt. Wget is a *windows* program (although running in console mode), not a *Dos* program (except for the real dos port I know exists but never tried out). I don't think there's a distinction between DOS and Windows programs in this regard. The C runtime library is most likely to play a significant role here. For a file fopen-ed in rt mode, teh RTL would convert \r\n - \n and silently eat the _first_ ^Z, returning EOF at that point. When writing, it goes the other way 'round WRT \n-\r\n. I'm unsure about whether it writes ^Z at the end, though. So personally I'd say it would not be really necessary adding support for the ^Z, even in the win32 port; except possibly for the Dos port, if the porter of that beast thinks it would be useful. Problem could be solved by opening .netrc in rt However, the t is a non-standard extension. However, this is not wget's problem IMO. Different editors may behave differently. Example: on OS/2 (which isn't a DOS shell, but can run DOS programs), the system editor (e.exe) *does* append a ^Z at the end of every file it saves. People have patched the binary to remove this feature :-) AFAIK no other OS/2 editor does this. -- Csaba Ráduly, Software Engineer Sophos Anti-Virus email: [EMAIL PROTECTED]http://www.sophos.com US Support: +1 888 SOPHOS 9 UK Support: +44 1235 559933
Re: Bug report: 1) Small error 2) Improvement to Manual
Herold Heiko [EMAIL PROTECTED] writes: My personal idea is: As a matter of fact no *windows* text editor I know of, even the supplied windows ones (notepad, wordpad) AFAIK will add the ^Z at the end of file.txt. Wget is a *windows* program (although running in console mode), not a *Dos* program (except for the real dos port I know exists but never tried out). So personally I'd say it would not be really neccessary adding support for the ^Z, even in the win32 port; That was my line of thinking too.
Re: Bug report: 1) Small error 2) Improvement to Manual
- Obviously, this is completely your decision. You are right, only DOS editors make the mistake. (It should be noted that DOS is MS Windows only command line language. It isn't going away; even Microsoft supplies command line utilities with all versions of its OSs. Yes, Windows will probably eventually go away, but not soon.) However, I have a comment: There is simple logic that would solve this problem. WGet, when it reads a line in the configuration file, probably now strips off trailing spaces (hex 20, decimal 32). I suggest that it strip off both trailing spaces and control characters (characters with hex values of 1F or less, decimal values of 31 or less). This is a simple change that would work in all cases. Regards, Michael __ Hrvoje Niksic wrote: Herold Heiko [EMAIL PROTECTED] writes: My personal idea is: As a matter of fact no *windows* text editor I know of, even the supplied windows ones (notepad, wordpad) AFAIK will add the ^Z at the end of file.txt. Wget is a *windows* program (although running in console mode), not a *Dos* program (except for the real dos port I know exists but never tried out). So personally I'd say it would not be really neccessary adding support for the ^Z, even in the win32 port; That was my line of thinking too.
RE: Bug report: 1) Small error 2) Improvement to Manual
From: Michael Jennings [mailto:[EMAIL PROTECTED]] Obviously, this is completely your decision. You are right, only DOS editors make the mistake. (It should be noted that DOS is MS Windows only command line language. It isn't going away; even Microsoft supplies command line utilities with all versions of its OSs. Yes, Windows will probably eventually go Please note the difference: all windows versions include a command line. However that commandline afaik is not dos - it is able to run dos programs, either because based on dos (win 9x) or because capable of understanding the difference between w32 commandline programs and dos programs, and starting the neccessary dos *emulation*. But it is not dos, and the behaviour is not like dos. As far as I know, windows command line programs do not use ^Z as end-of-file terminators (although some do honour it for emulation/compatibility), only real dos programs do (anybody knows if there is a - MS - standard for this ?). If this is true, should wget on windows really emulate the behaviour of dos programs, of a environment windows originally was based on but where it is *not*running*anymore* (wget I mean) ? From a purists point of view, not. From a end-user point of view, possibly in order to facilitate the changeover. On the other hand, your report is the first one I ever saw, considering Hrvoje's reaction and the lack of support in the original windows port I'd say this is not a problem generally felt as important, so personally I'm in favor of not cluttering up the port anymore with special behaviour. But it is Hrvoje's decsion, as always. If you feel it is important write a patch and submit it, shouldn't be a major piece of work. Heiko -- -- PREVINET S.p.A.[EMAIL PROTECTED] -- Via Ferretto, 1ph x39-041-5907073 -- I-31021 Mogliano V.to (TV) fax x39-041-5907087 -- ITALY
Re: Bug report
Pavel Stepchenko [EMAIL PROTECTED] writes: Hello bug-wget, $ wget --version GNU Wget 1.8 $ wget ftp://password:[EMAIL PROTECTED]:12345/Dir%20One/This.Is.Long.Name.Of.The.Directory/* Warning: wildcards not supported in HTTP. Oooops! But this is FTP url, not HTTP! Are you using a proxy?