Re: Bug report

2006-04-01 Thread Frank McCown

Gary Reysa wrote:

Hi,

I don't really know if this is a Wget bug, or some problem with my 
website, but, either way, maybe you can help.


I have a web site ( www.BuildItSolar.com ) with perhaps a few hundred 
pages (260MB of storage total).  Someone did a Wget on my site, and 
managed to log 111,000 hits and 58,000 page views (using more than a GB 
of bandwidth).


I am wondering how this can happen, since the number of page views is 
about 200 times the number of pages on my site??


Is there something I can do to prevent this?  Is there something about 
the organization of my website that is causing Wget to get stuck in a loop?


I've never used Wget, but I am guessing that this guy really did not 
want 50,000+ pages -- do you provide some way for the user to shut 
itself down when it reaches some reasonable limit?


My website is non-commercial, and provides a lot of information that 
people find useful in building renewable energy projects.  It generates 
zero income, and I can't really afford to have a lot of people come in 
and burn up GBs of bandwidth to no useful end.  Help!


Gary Reysa


Bozeman, MT
[EMAIL PROTECTED]



Hello Gary,

From a quick look at your site, it appears to be mainly static html 
that would not generate a lot of extra crawls.  If you have some dynamic 
portion of your site, like a calendar, that could make wget go into an 
infinite loop.  It would be much easier to tell if you could look at the 
server logs that show what pages were requested.  They would easily tell 
you want wget was getting hung on.


One problem I did notice is that your site is generating soft 404s. 
In other words, it is sending back a http 200 response when it should be 
sending back a 404 response.  So if wget tries to access


http://www.builditsolar.com/blah

your web server is telling wget that the page actually exists.  This 
*could* cause more crawls than necessary, but not likely.  This problem 
should be fixed though.


It's possible the wget user did not know what they were doing and ran 
the crawler several times.  You could try to block traffic from that 
particular IP address or create a robots.txt file that tells crawlers to 
stay away from your site or just certain pages.  Wget respects 
robots.txt.  For more info:


http://www.robotstxt.org/wc/robots.html

Regards,
Frank



Re: Bug report

2004-03-24 Thread Hrvoje Niksic
Juhana Sadeharju [EMAIL PROTECTED] writes:

 Command: wgetdir http://liarliar.sourceforge.net;.
 Problem: Files are named as
   content.php?content.2
   content.php?content.3
   content.php?content.4
 which are interpreted, e.g., by Nautilus as manual pages and are
 displayed as plain texts. Could the files and the links to them
 renamed as the following?
   content.php?content.2.html
   content.php?content.3.html
   content.php?content.4.html

Use the option `--html-extension' (-E).

 After all, are those pages still php files or generated html files?
 If they are html files produced by the php files, then it could be a
 good idea to add a new extension to the files.

They're the latter -- HTML files produced by the server-side PHP code.

 Command: wgetdir 
 http://www.newtek.com/products/lightwave/developer/lscript2.6/index.html;
 Problem: Images are not downloaded. Perhaps because the image links
 are the following:
   image src=v26_2.jpg

I've never seen this tag, but it seems to be the same as IMG.  Mozilla
seems to grok it and its DOM inspector thinks it has seen IMG.  Is
this tag documented anywhere?  Does IE understand it too?



Re: bug report and patch, HTTPS recursive get

2002-05-17 Thread Kiyotaka Doumae


In message Re: bug report and patch, HTTPS recursive get,
Ian Abbott wrote...
 Thanks again for the bug report and the proposed patch.  I thought some
 of the scheme tests in recur.c were getting messy, so propose the
 following patch that uses a function to check for similar schemes.

Thanks for your rewriting.
By your patch, the problem was solved.

Thankyou

---
Doumae Kiyotaka
Internet Initiative Japan Inc.
Technical Planning Division



Re: bug report and patch, HTTPS recursive get

2002-05-15 Thread Ian Abbott

On Wed, 15 May 2002 18:44:19 +0900, Kiyotaka Doumae [EMAIL PROTECTED]
wrote:

I found a bug of wget with HTTPS resursive get, and proposal
a patch.

Thanks for the bug report and the proposed patch.  The current scheme
comparison checks are getting messy, so I'll write a function to check
schemes for similarity (when I can spare the time later today).



Re: Bug report

2002-05-04 Thread Ian Abbott

On Fri, 3 May 2002 18:37:22 +0200, Emmanuel Jeandel
[EMAIL PROTECTED] wrote:

ejeandel@yoknapatawpha:~$ wget -r a:b
Segmentation fault

Patient: Doctor, it hurts when I do this
Doctor: Well don't do that then!

Seriously, this is already fixed in CVS.



Re: Bug report: 1) Small error 2) Improvement to Manual

2002-01-21 Thread Ian Abbott

On 17 Jan 2002 at 2:15, Hrvoje Niksic wrote:

 Michael Jennings [EMAIL PROTECTED] writes:
  WGet returns an error message when the .wgetrc file is terminated
  with an MS-DOS end-of-file mark (Control-Z). MS-DOS is the
  command-line language for all versions of Windows, so ignoring the
  end-of-file mark would make sense.
 
 Ouch, I never thought of that.  Wget opens files in binary mode and
 handles the line termination manually -- but I never thought to handle
 ^Z.

Why not just open the wgetrc file in text mode using
fopen(name, r) instead of rb? Does that introduce other
problems?

In the Windows C compilers I've tried (Microsoft and Borland ones),
r causes the file to be opened in text mode by default (there are
ways to override that at compile time and/or run time), and this
causes the ^Z to be treated as an EOF (there might be ways to
override that too).



Re: Bug report: 1) Small error 2) Improvement to Manual

2002-01-21 Thread Thomas Lussnig



WGet returns an error message when the .wgetrc file is terminated
with an MS-DOS end-of-file mark (Control-Z). MS-DOS is the
command-line language for all versions of Windows, so ignoring the
end-of-file mark would make sense.

Ouch, I never thought of that.  Wget opens files in binary mode and
handles the line termination manually -- but I never thought to handle
^Z.


Why not just open the wgetrc file in text mode using
fopen(name, r) instead of rb? Does that introduce other
problems?

In the Windows C compilers I've tried (Microsoft and Borland ones),
r causes the file to be opened in text mode by default (there are
ways to override that at compile time and/or run time), and this
causes the ^Z to be treated as an EOF (there might be ways to
override that too).

I think it has to do with comments because the defeinition is that 
starting with '#'  the rest of the line
is ignored. And an line ends with '\n' or the end of the file and not 
with and spezial charakter '\0' that
mean for me that to abort the reading of an textfile when zero isfound 
mean's incorrect parsing.

Cu Thomas Lußnig




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Bug report: 1) Small error 2) Improvement to Manual

2002-01-21 Thread Ian Abbott

On 21 Jan 2002 at 14:56, Thomas Lussnig wrote:

 Why not just open the wgetrc file in text mode using
 fopen(name, r) instead of rb? Does that introduce other
 problems?
 I think it has to do with comments because the defeinition is that 
 starting with '#'  the rest of the line
 is ignored. And an line ends with '\n' or the end of the file and not 
 with and spezial charakter '\0' that
 mean for me that to abort the reading of an textfile when zero isfound 
 mean's incorrect parsing.

(N.B. the control-Z character would be '\032', not '\0'.)

So maybe just mention in the documentation that the wgetrc file is
considered to be a plain text file, whatever that means for the
system Wget is running on. Maybe mention peculiaries of
DOS/Windows, etc.

In general, it is more portable to read or write native text files
in text mode as it performs whatever local conversions are
necessary to make reads and writes of text files appear like UNIX
i.e. each line of text terminated by a newline '\n'). In binary
mode, what you get depends on the system (Mac text files have lines
terminated by carriage return ('\r') for example, and some systems
(VMS?) don't even have line termination characters as such.)

In the case of Wget, log files are already written in text mode. I
think wgetrc needs to be read in text mode and that's an easy
change.

In the case of the --input-file option, ideally the input file
should be read in text mode unless the --force-html option is used,
in which case it should be read in the same mode as when parsing
other locally-stored HTML files.

Wget stores retrieved files in binary mode but the mode used when
reading those locally-stored files is less precise (not that it
makes much difference for UNIX). It uses open() (not fopen()) and
read() to read those files into memory (or uses mmap() to map them
into memory space if supported). The DOS/Windows version of open()
allows you to specify text or binary mode, defaulting to text mode,
so it looks like the Windows version of Wget saves html files in
binary mode and reads them back in in text mode! Well whatever -
the HTML parser still seems to work okay on Windows, probably
because HTML isn't that fussy about line-endings anyway!

So to support --input-file portably (not the --force-html version),
the get_urls_file() function in url.c should probably call a new
function read_file_text() (or read_text_file() instead of
read_file() as it does at the moment. For UNIX-type systems, that
could just fall back to calling read_file().

The local HTML file parsing stuff should probably be left well
alone but possibly add some #ifdef code for Windows to open the
file in binary mode, though there may be differences between
compilers for that.




RE: Bug report: 1) Small error 2) Improvement to Manual

2002-01-17 Thread csaba . raduly


On 17/01/2002 07:34:05 Herold Heiko wrote:
[proper order restored]
 -Original Message-
 From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, January 17, 2002 2:15 AM
 To: Michael Jennings
 Cc: [EMAIL PROTECTED]
 Subject: Re: Bug report: 1) Small error 2) Improvement to Manual


 Michael Jennings [EMAIL PROTECTED] writes:

  1) There is a very small bug in WGet version 1.8.1. The bug occurs
 when a .wgetrc file is edited using an MS-DOS text editor:
 
  WGet returns an error message when the .wgetrc file is terminated
  with an MS-DOS end-of-file mark (Control-Z). MS-DOS is the
  command-line language for all versions of Windows, so ignoring the
  end-of-file mark would make sense.

 Ouch, I never thought of that.  Wget opens files in binary mode and
 handles the line termination manually -- but I never thought to handle
 ^Z.

 As much as I'd like to be helpful, I must admit I'm loath to encumber
 the code with support for this particular thing.  I have never seen it
 before; is it only an artifact of DOS editors, or is it used on
 Windows too?



[snip copy con file.txt]

However in this case (at least when I just tried) the file won't contain
the ^Z. OTOH some DOS programs still will work on NT4, NT2k and XP, and
could be used, and would create files ending with ^Z. But do they really
belong here and should wget be bothered ?

What we really need to know is:

Is ^Z still a valid, recognized character indicating end-of-file (for
textmode files) for command shell programs on windows NT 4/2k/Xp ?
Somebody with access to the *windows standards* could shed more light on
this question ?

My personal idea is:
As a matter of fact no *windows* text editor I know of, even the
supplied windows ones (notepad, wordpad) AFAIK will add the ^Z at the
end of file.txt. Wget is a *windows* program (although running in
console mode), not a *Dos* program (except for the real dos port I know
exists but never tried out).


I don't think there's a distinction between DOS and Windows programs
in this regard. The C runtime library is most likely to play a
significant role here. For a file fopen-ed in rt mode, teh RTL
would convert \r\n - \n and silently eat the _first_ ^Z,
returning EOF at that point.

When writing, it goes the other way 'round WRT \n-\r\n.
I'm unsure about whether it writes ^Z at the end, though.

So personally I'd say it would not be really necessary adding support
for the ^Z, even in the win32 port; except possibly for the Dos port, if
the porter of that beast thinks it would be useful.


Problem could be solved by opening .netrc in rt
However, the t is a non-standard extension.

However, this is not wget's problem IMO. Different editors may behave
differently. Example: on OS/2 (which isn't a DOS shell, but can run
DOS programs), the system editor (e.exe) *does* append a ^Z at the end
of every file it saves. People have patched the binary to remove this
feature :-) AFAIK no other OS/2 editor does this.


--
Csaba Ráduly, Software Engineer   Sophos Anti-Virus
email: [EMAIL PROTECTED]http://www.sophos.com
US Support: +1 888 SOPHOS 9 UK Support: +44 1235 559933




Re: Bug report: 1) Small error 2) Improvement to Manual

2002-01-17 Thread Hrvoje Niksic

Herold Heiko [EMAIL PROTECTED] writes:

 My personal idea is:
 As a matter of fact no *windows* text editor I know of, even the
 supplied windows ones (notepad, wordpad) AFAIK will add the ^Z at the
 end of file.txt. Wget is a *windows* program (although running in
 console mode), not a *Dos* program (except for the real dos port I know
 exists but never tried out).
 
 So personally I'd say it would not be really neccessary adding support
 for the ^Z, even in the win32 port;

That was my line of thinking too.



Re: Bug report: 1) Small error 2) Improvement to Manual

2002-01-17 Thread Michael Jennings

-


Obviously, this is completely your decision. You are right, only DOS editors make the 
mistake. (It should be noted that DOS is MS Windows only command line language. It 
isn't going away; even Microsoft supplies command line utilities with all versions of 
its OSs. Yes, Windows will probably eventually go away, but not soon.)

However, I have a comment: There is simple logic that would solve this problem. WGet, 
when it reads a line in the configuration file, probably now strips off trailing 
spaces (hex 20, decimal 32). I suggest that it strip off both trailing spaces and 
control characters (characters with hex values of 1F or less, decimal values of 31 or 
less). This is a simple change that would work in all cases.

Regards,

Michael


__


Hrvoje Niksic wrote:

 Herold Heiko [EMAIL PROTECTED] writes:

  My personal idea is:
  As a matter of fact no *windows* text editor I know of, even the
  supplied windows ones (notepad, wordpad) AFAIK will add the ^Z at the
  end of file.txt. Wget is a *windows* program (although running in
  console mode), not a *Dos* program (except for the real dos port I know
  exists but never tried out).
 
  So personally I'd say it would not be really neccessary adding support
  for the ^Z, even in the win32 port;

 That was my line of thinking too.




RE: Bug report: 1) Small error 2) Improvement to Manual

2002-01-17 Thread Herold Heiko

 From: Michael Jennings [mailto:[EMAIL PROTECTED]]
 Obviously, this is completely your decision. You are right, 
 only DOS editors make the mistake. (It should be noted that 
 DOS is MS Windows only command line language. It isn't going 
 away; even Microsoft supplies command line utilities with all 
 versions of its OSs. Yes, Windows will probably eventually go 

Please note the difference: all windows versions include a command line.
However that commandline afaik is not dos - it is able to run dos
programs, either because based on dos (win 9x) or because capable of
understanding the difference between w32 commandline programs and dos
programs, and starting the neccessary dos *emulation*. But it is not
dos, and the behaviour is not like dos.
As far as I know, windows command line programs do not use ^Z as
end-of-file terminators (although some do honour it for
emulation/compatibility), only real dos programs do (anybody knows if
there is a - MS - standard for this ?). If this is true, should wget on
windows really emulate the behaviour of dos programs, of a environment
windows originally was based on but where it is *not*running*anymore*
(wget I mean) ? From a purists point of view, not. From a end-user point
of view, possibly in order to facilitate the changeover.
On the other hand, your report is the first one I ever saw, considering
Hrvoje's reaction and the lack of support in the original windows port
I'd say this is not a problem generally felt as important, so personally
I'm in favor of not cluttering up the port anymore with special
behaviour. But it is Hrvoje's decsion, as always.
If you feel it is important write a patch and submit it, shouldn't be a
major piece of work.
 
Heiko

-- 
-- PREVINET S.p.A.[EMAIL PROTECTED]
-- Via Ferretto, 1ph  x39-041-5907073
-- I-31021 Mogliano V.to (TV) fax x39-041-5907087
-- ITALY



Re: Bug report

2001-12-13 Thread Hrvoje Niksic

Pavel Stepchenko [EMAIL PROTECTED] writes:

 Hello bug-wget,
 
 $ wget --version
 GNU Wget 1.8
 
 $ wget 
ftp://password:[EMAIL PROTECTED]:12345/Dir%20One/This.Is.Long.Name.Of.The.Directory/*
 Warning: wildcards not supported in HTTP.
 
 Oooops! But this is FTP url, not HTTP!

Are you using a proxy?