RE: add tar option

2002-04-23 Thread Herold Heiko

I think wget needs sometimes (often) to reread what it wrote to the disk
(html conversion). This means something like that wouldn't work, or better,
would be to specialized.

What would work better is a (sometimes requested in the past) switch to
output to a file a list of everything retrieved (or better everything saved
to disk), then you could use that (for example as input to cpio or whatever
you prefer).

Heiko

-- 
-- PREVINET S.p.A.[EMAIL PROTECTED]
-- Via Ferretto, 1ph  x39-041-5907073
-- I-31021 Mogliano V.to (TV) fax x39-041-5907472
-- ITALY

 -Original Message-
 From: Max Waterman [mailto:[EMAIL PROTECTED]]
 Sent: Monday, April 22, 2002 10:54 PM
 To: [EMAIL PROTECTED]
 Subject: RFE:add tar option
 
 
 Hi,
 
 I recently had need to pipe what wget retrieved through a 
 command before 
 writing to disk. There was no way I could do this with the 
 version I had.
 
 What I would like to wget to do is to create a tar stream of 
 the files 
 and directories it is downloading and send that to stdout, 
 kind of like :
 
 tar -cvvf - files...
 
 then I could pipe that into whatever I wanted, for example :
 
 $ wget -r -l 3 --tar 'http://www.sgi.com/' | other commands | 
 tar -xvvf -
 
 Anyone think this is a good idea?
 
 Please 'cc' me, since I am not on the email list.
 
 Thanks.
 
 Max.
 



Unsubscribe due to rebrand.

2002-04-23 Thread Ian . Pellew



Guys

Apologies for using this route.

My EMAILED CHANGED since I subscribed.

Please can the administrator unsubscribe/help me because my email address
has changed ((CompanRrebrand)
I am present on your list as [EMAIL PROTECTED] not the above.

Unsuscribing as going on vacation.

Happy wgetting to U all.

Ian




ScanMail Message: To Recipient virus found or matched file blocking setting.

2002-04-23 Thread System Attendant

ScanMail for Microsoft Exchange has taken action on the message, please
refer to the contents of this message for further details.

Sender = [EMAIL PROTECTED]
Recipient(s) = [EMAIL PROTECTED];
Subject = To your DTD.
Scanning Time = 04/23/2002 10:45:48
Engine/Pattern = 6.150-1001/269

Action on message:
The attachment Qa.bat matched file blocking settings. ScanMail has taken the
Deleted action. 

In einer für Sie bestimmten Nachricht wurde ein als gefährlich eingestufter
Anhang geblockt oder es wurde ein Virus gefunden. Der Absender der Nachricht
wird ebenfalls automatisch informiert. Als gefährlich eingestuft gelten u.A.
alle ausführbaren Dateien wie z.B. *.exe, *.bat, *.com, *.cmd, *.pif, *.scr.
Wenn sie eine Datei mit entsprechender Endung verschicken oder empfangen
wollen, komprimieren sie diese bitte zu einer *.zip-Datei mit Winzip.
An attachment has been blocked which is classified as dangerous or a Virus
has been found in the mail received by you. The sender of this mail was
automatically informed. Among the attachments classified as dangerous are
all executable files like *.exe, *.bat, *.com, *.cmd, *.pif, *.scr. If you
need to send or receive such an attachment you should compress it first into
a *.zip archive by using Winzip.



Re: segmentation fault on bad url

2002-04-23 Thread Ian Abbott

On 22 Apr 2002 at 21:38, Renaud Saliou wrote:

 Hi,
 
   wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after 
 -A.jpg,.gif,.zip,.png,.pdf http://http://www.microsoft.com 
 
 DEBUG output created by Wget 1.8.1 on linux-gnu.
 
 zsh: segmentation fault  wget -t 3 -d -r -l 3 -H --random-wait -nd 
 --delete-after

It looks like this has been fixed in the current CVS version
(actually a few days old):

$ wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after \
-A.jpg,.gif,.zip,.png,.pdf http://http://www.microsoft.com
DEBUG output created by Wget 1.8.1+cvs on linux-gnu.

http://http://www.microsoft.com: Bad port number.

FINISHED --10:36:45--
Downloaded: 0 bytes in 0 files




Re: apache irritations

2002-04-23 Thread Maciej W. Rozycki

On Mon, 22 Apr 2002, Tony Lewis wrote:

   I'm not sure what you are referring to.  We are discussing a common
  problem with static pages generated by default by Apache as index.html
  objects for server's filesystem directories providing no default page.
 
 Really? The original posting from Jamie Zawinski said:
 
  I know this would be somewhat evil, but can we have a special case in
  wget to assume that files named ?N=D and index.html?N=D are the same
  as index.html?  I'm tired of those dumb apache sorting directives
  showing up in my mirrors as if they were real files...
 
 I understood the question to be about URLs containing query strings (which
 Jamie called sorting directives) showing up as separate files. I thought the
 discussion was related to that topic. Maybe it diverged from that later in
 the chain and I missed the change of topic.

 These sorting directives are specific to Apache when it builds a
replacement index.html file for server's file system directories
containing no default page (assuming neither such building nor the
directives are disabled).  They have always the form of
?capital=capital appended to the base URL of a directory.  See e.g. 
http://www.kernel.org/pub/linux/; and its subdirectories for how it looks
like.

 I think what Jamie wants is one copy of index.html no matter how many links
 of the form index.html?N=D appear.

 So do I and my shell pattern will work as expected.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+




wget does not honour content-length http header [http://bugs.debian.org/143736]

2002-04-23 Thread Noel Koethe

Hello,

If the http content-length header differs from actual data length,
wget disregards the http specification as follows:
1) if content-length is greater than actual data, wget keeps retrying to
receive the whole file indefinitely. Using the command-line parameter
--ignore-length fixes this but should it not be on by default?
2) If content-length is smaller than actual data sent by server, wget
happily downloads it all instead of stopping at what ever content-length
specified. This is contrary to the spec which strictly states that
content-length must be obeyed and that the user must be notified that
something strange happened. It correctly tells the user that it received
nnn/mmm bytes, where mmm is content-length but should there not be an
error message, too?

http://bugs.debian.org/143736

Thank you.

-- 
Noèl Köthe



Re: RFE:add tar option

2002-04-23 Thread Maciej W. Rozycki

On Mon, 22 Apr 2002, Max Waterman wrote:

 Someone (rudely) suggested it was unacceptable to ask for a 'cc' rather 
 than joining the email list. If this is so, I apologise, but would like 
 to point out that I was only following the suggestion on the wget web page :

 I believe such suggestions (whether rude or polite) are unacceptable
themselves.  For various reasons subscribing to every mailing list out
there just to report problems, provide suggestions, etc. without doing
regular work may cause troubles to people.  E.g. I am involved in several
projects that require tracking or participating in discussions at mailing
lists.  I am subscribed to about 30 mailing lists now and to be able to
cope with the considerable amount of mail I receive I must seriously weigh
every additional subscribtion. 

 If cc-ing was unacceptable here, the list could equally well be closed to
non-subscribers.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+




Re: wget does not honour content-length http header[http://bugs.debian.org/143736]

2002-04-23 Thread Hrvoje Niksic

Noel Koethe [EMAIL PROTECTED] writes:

 If the http content-length header differs from actual data length,
 wget disregards the http specification as follows:

It doesn't disregard the HTTP specification.  As far as I'm aware,
HTTP simply specifies that the information provided by Content-Length
must be correct.  When it is not correct, the protocol has been broken
by the server and the best Wget can do is try to make sense of the
situation.  In both cases you report, Wget's behavior is by design.

 1) if content-length is greater than actual data, wget keeps
 retrying to receive the whole file indefinitely.

Not indefinitely, but until `--tries' attempts (20 by default) have
been exhausted.

 Using the command-line parameter --ignore-length fixes this but
 should it not be on by default?

No.  When you're downloading files over a slow or unstable network,
you will often get EOF while reading data.  Retrying in spite of that
EOF has been one of Wget's primary features since the very beginning.

So Wget is not disregarding the spec, it is *honoring* it by assuming
that the provided Content-Length is correct, as it should be.  This
feature has made many a download possible.  In the cases where the
content-length header truly is broken, use `--ignore-length'.

 2) If content-length is smaller than actual data sent by server,
 wget happily downloads it all instead of stopping at what ever
 content-length specified.

Again, this is a feature.  Broken CGI scripts often report broken
values for `Content-Length'.  When more data arrives, it becomes
apparent that the reported value is *broken* (unlike in the case when
less data arrives).  Wget can either dismiss the rest of the data or
dismiss the header.  I judged the data actually transmitted over the
wire to be more important than one obviously broken header.

The exception is when persistent connections are used.  In that case,
Content-Length is honored to the letter, and the remote server had
*better* provide the correct value, or else.

 This is contrary to the spec which strictly states that
 content-length must be obeyed and that the user must be notified
 that something strange happened.

Which spec says that?



Re: add tar option

2002-04-23 Thread Hrvoje Niksic

Herold Heiko [EMAIL PROTECTED] writes:

 I think wget needs sometimes (often) to reread what it wrote to the
 disk (html conversion). This means something like that wouldn't
 work, or better, would be to specialized.

In the long run, I hope to fix that.  The first step has already been
done -- Wget is traversing the links breadth-first, which means that
it only needs to read the HTML file once.

The next step would be to allow Wget's reader to read directly into
memory, or to read both into memory and print to stdout.  This way,
things like `wget --spider -r URL' or `wget -O foo -r URL' would work
perfectly.  Alternately, Wget could write into a temporary file, read
the HTML, and discard the file.

I don't see much use for adding the `--tar' functionality to Wget
because Wget should preferrably do one thing (download stuff off the
web), and do it well -- post-processing of the output, such as
serializing it into a stream, should be done by a separate utility --
in this case, `tar'.

On technical grounds, it might be hard to shoehorn Wget's mode of
operation into what `tar' expects.  For example, Wget might need to
revisit directories in random order.  I'm not sure if a tar stream is
allowed to do that.

vision
However, it might be cool to create a simple output format for
serializing the result of a Wget run.  Then a converter could be
provided that converts this to tar, cpio, whatever.  The downside to
this is that we would invent Yet Another Format, but the upside would
be that Wget proper would not depend on external libraries to support
`tar' and whatnot.
/vision



Re: add tar option

2002-04-23 Thread Ian Abbott

On 23 Apr 2002 at 18:19, Hrvoje Niksic wrote:

 On technical grounds, it might be hard to shoehorn Wget's mode of
 operation into what `tar' expects.  For example, Wget might need to
 revisit directories in random order.  I'm not sure if a tar stream is
 allowed to do that.

You can add stuff to a tar stream in a pretty much random order -
that's effectively what you get when you use tar's -r option to
append to the end of an existing archive.  (I used to use that with
tapes quite often, once upon a time.)



Re: Name changing

2002-04-23 Thread Hrvoje Niksic

Caddell, Travis [EMAIL PROTECTED] writes:

 I'm stuck with Windows at my office :(
 But what option offered by wget would allow the user to specify the name of
 the folder that the web site would be saved in.
   For example if I were to   wget -cdr www.cnn.com  the folder would be
 named www.cnn.com BUT for instance if I wanted to specify the name of the
 folder to be saved as MyNews - is there a command for that?

Sure.  Use `-P MyNews' -- then it will be saved to
MyNews/www.cnn.com/...  You can further disable the host name
component in the file name with `-nH', in which case it will be just
`MyNews/...'.

 Another question along the same lines - If I wanted to use an input
 file wget -i C:\MyNewsLinks.txt  and then save each one in their
 own folder Cnn (instead of www.cnn.com) WashPost (instead of
 www.washingtonpost.com) - Is there a command for that too.

I don't think that's possible.



Feature request

2002-04-23 Thread Frederic Lochon \(crazyfred\)

Hello,

I'd like to know if there is a simple way to 'mirror' only the images
from a galley (ie. without thumbnails).
Maybe a new feature could be useful. This could be done throught this
ways:
- mirroring only images that are a link
- mirroring only 'last' links from a tree
- a more general option that would allow to execute a script/command to
know if the file has to be saved (using the filename, or the content
-may be another option). This could help saving some disk space when
mirroring many sites or big sites. Maybe this option could be used to
adapt (renaming) filenames 'on-the-fly'.

Some galleries can't be mirror througt these ways, but many can


It also seems these options are incompatible:
--continue with --recursive
This could be useful, imho.

PS: sorry if these questions are recurent, I don't have time to read
every post (on the news server at sunsite.dk) and I haven't seen any
FAQ.

Thanks,

Frederic Lochon