RE: add tar option
I think wget needs sometimes (often) to reread what it wrote to the disk (html conversion). This means something like that wouldn't work, or better, would be to specialized. What would work better is a (sometimes requested in the past) switch to output to a file a list of everything retrieved (or better everything saved to disk), then you could use that (for example as input to cpio or whatever you prefer). Heiko -- -- PREVINET S.p.A.[EMAIL PROTECTED] -- Via Ferretto, 1ph x39-041-5907073 -- I-31021 Mogliano V.to (TV) fax x39-041-5907472 -- ITALY -Original Message- From: Max Waterman [mailto:[EMAIL PROTECTED]] Sent: Monday, April 22, 2002 10:54 PM To: [EMAIL PROTECTED] Subject: RFE:add tar option Hi, I recently had need to pipe what wget retrieved through a command before writing to disk. There was no way I could do this with the version I had. What I would like to wget to do is to create a tar stream of the files and directories it is downloading and send that to stdout, kind of like : tar -cvvf - files... then I could pipe that into whatever I wanted, for example : $ wget -r -l 3 --tar 'http://www.sgi.com/' | other commands | tar -xvvf - Anyone think this is a good idea? Please 'cc' me, since I am not on the email list. Thanks. Max.
Unsubscribe due to rebrand.
Guys Apologies for using this route. My EMAILED CHANGED since I subscribed. Please can the administrator unsubscribe/help me because my email address has changed ((CompanRrebrand) I am present on your list as [EMAIL PROTECTED] not the above. Unsuscribing as going on vacation. Happy wgetting to U all. Ian
ScanMail Message: To Recipient virus found or matched file blocking setting.
ScanMail for Microsoft Exchange has taken action on the message, please refer to the contents of this message for further details. Sender = [EMAIL PROTECTED] Recipient(s) = [EMAIL PROTECTED]; Subject = To your DTD. Scanning Time = 04/23/2002 10:45:48 Engine/Pattern = 6.150-1001/269 Action on message: The attachment Qa.bat matched file blocking settings. ScanMail has taken the Deleted action. In einer für Sie bestimmten Nachricht wurde ein als gefährlich eingestufter Anhang geblockt oder es wurde ein Virus gefunden. Der Absender der Nachricht wird ebenfalls automatisch informiert. Als gefährlich eingestuft gelten u.A. alle ausführbaren Dateien wie z.B. *.exe, *.bat, *.com, *.cmd, *.pif, *.scr. Wenn sie eine Datei mit entsprechender Endung verschicken oder empfangen wollen, komprimieren sie diese bitte zu einer *.zip-Datei mit Winzip. An attachment has been blocked which is classified as dangerous or a Virus has been found in the mail received by you. The sender of this mail was automatically informed. Among the attachments classified as dangerous are all executable files like *.exe, *.bat, *.com, *.cmd, *.pif, *.scr. If you need to send or receive such an attachment you should compress it first into a *.zip archive by using Winzip.
Re: segmentation fault on bad url
On 22 Apr 2002 at 21:38, Renaud Saliou wrote: Hi, wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after -A.jpg,.gif,.zip,.png,.pdf http://http://www.microsoft.com DEBUG output created by Wget 1.8.1 on linux-gnu. zsh: segmentation fault wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after It looks like this has been fixed in the current CVS version (actually a few days old): $ wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after \ -A.jpg,.gif,.zip,.png,.pdf http://http://www.microsoft.com DEBUG output created by Wget 1.8.1+cvs on linux-gnu. http://http://www.microsoft.com: Bad port number. FINISHED --10:36:45-- Downloaded: 0 bytes in 0 files
Re: apache irritations
On Mon, 22 Apr 2002, Tony Lewis wrote: I'm not sure what you are referring to. We are discussing a common problem with static pages generated by default by Apache as index.html objects for server's filesystem directories providing no default page. Really? The original posting from Jamie Zawinski said: I know this would be somewhat evil, but can we have a special case in wget to assume that files named ?N=D and index.html?N=D are the same as index.html? I'm tired of those dumb apache sorting directives showing up in my mirrors as if they were real files... I understood the question to be about URLs containing query strings (which Jamie called sorting directives) showing up as separate files. I thought the discussion was related to that topic. Maybe it diverged from that later in the chain and I missed the change of topic. These sorting directives are specific to Apache when it builds a replacement index.html file for server's file system directories containing no default page (assuming neither such building nor the directives are disabled). They have always the form of ?capital=capital appended to the base URL of a directory. See e.g. http://www.kernel.org/pub/linux/; and its subdirectories for how it looks like. I think what Jamie wants is one copy of index.html no matter how many links of the form index.html?N=D appear. So do I and my shell pattern will work as expected. -- + Maciej W. Rozycki, Technical University of Gdansk, Poland + +--+ +e-mail: [EMAIL PROTECTED], PGP key available+
wget does not honour content-length http header [http://bugs.debian.org/143736]
Hello, If the http content-length header differs from actual data length, wget disregards the http specification as follows: 1) if content-length is greater than actual data, wget keeps retrying to receive the whole file indefinitely. Using the command-line parameter --ignore-length fixes this but should it not be on by default? 2) If content-length is smaller than actual data sent by server, wget happily downloads it all instead of stopping at what ever content-length specified. This is contrary to the spec which strictly states that content-length must be obeyed and that the user must be notified that something strange happened. It correctly tells the user that it received nnn/mmm bytes, where mmm is content-length but should there not be an error message, too? http://bugs.debian.org/143736 Thank you. -- Noèl Köthe
Re: RFE:add tar option
On Mon, 22 Apr 2002, Max Waterman wrote: Someone (rudely) suggested it was unacceptable to ask for a 'cc' rather than joining the email list. If this is so, I apologise, but would like to point out that I was only following the suggestion on the wget web page : I believe such suggestions (whether rude or polite) are unacceptable themselves. For various reasons subscribing to every mailing list out there just to report problems, provide suggestions, etc. without doing regular work may cause troubles to people. E.g. I am involved in several projects that require tracking or participating in discussions at mailing lists. I am subscribed to about 30 mailing lists now and to be able to cope with the considerable amount of mail I receive I must seriously weigh every additional subscribtion. If cc-ing was unacceptable here, the list could equally well be closed to non-subscribers. -- + Maciej W. Rozycki, Technical University of Gdansk, Poland + +--+ +e-mail: [EMAIL PROTECTED], PGP key available+
Re: wget does not honour content-length http header[http://bugs.debian.org/143736]
Noel Koethe [EMAIL PROTECTED] writes: If the http content-length header differs from actual data length, wget disregards the http specification as follows: It doesn't disregard the HTTP specification. As far as I'm aware, HTTP simply specifies that the information provided by Content-Length must be correct. When it is not correct, the protocol has been broken by the server and the best Wget can do is try to make sense of the situation. In both cases you report, Wget's behavior is by design. 1) if content-length is greater than actual data, wget keeps retrying to receive the whole file indefinitely. Not indefinitely, but until `--tries' attempts (20 by default) have been exhausted. Using the command-line parameter --ignore-length fixes this but should it not be on by default? No. When you're downloading files over a slow or unstable network, you will often get EOF while reading data. Retrying in spite of that EOF has been one of Wget's primary features since the very beginning. So Wget is not disregarding the spec, it is *honoring* it by assuming that the provided Content-Length is correct, as it should be. This feature has made many a download possible. In the cases where the content-length header truly is broken, use `--ignore-length'. 2) If content-length is smaller than actual data sent by server, wget happily downloads it all instead of stopping at what ever content-length specified. Again, this is a feature. Broken CGI scripts often report broken values for `Content-Length'. When more data arrives, it becomes apparent that the reported value is *broken* (unlike in the case when less data arrives). Wget can either dismiss the rest of the data or dismiss the header. I judged the data actually transmitted over the wire to be more important than one obviously broken header. The exception is when persistent connections are used. In that case, Content-Length is honored to the letter, and the remote server had *better* provide the correct value, or else. This is contrary to the spec which strictly states that content-length must be obeyed and that the user must be notified that something strange happened. Which spec says that?
Re: add tar option
Herold Heiko [EMAIL PROTECTED] writes: I think wget needs sometimes (often) to reread what it wrote to the disk (html conversion). This means something like that wouldn't work, or better, would be to specialized. In the long run, I hope to fix that. The first step has already been done -- Wget is traversing the links breadth-first, which means that it only needs to read the HTML file once. The next step would be to allow Wget's reader to read directly into memory, or to read both into memory and print to stdout. This way, things like `wget --spider -r URL' or `wget -O foo -r URL' would work perfectly. Alternately, Wget could write into a temporary file, read the HTML, and discard the file. I don't see much use for adding the `--tar' functionality to Wget because Wget should preferrably do one thing (download stuff off the web), and do it well -- post-processing of the output, such as serializing it into a stream, should be done by a separate utility -- in this case, `tar'. On technical grounds, it might be hard to shoehorn Wget's mode of operation into what `tar' expects. For example, Wget might need to revisit directories in random order. I'm not sure if a tar stream is allowed to do that. vision However, it might be cool to create a simple output format for serializing the result of a Wget run. Then a converter could be provided that converts this to tar, cpio, whatever. The downside to this is that we would invent Yet Another Format, but the upside would be that Wget proper would not depend on external libraries to support `tar' and whatnot. /vision
Re: add tar option
On 23 Apr 2002 at 18:19, Hrvoje Niksic wrote: On technical grounds, it might be hard to shoehorn Wget's mode of operation into what `tar' expects. For example, Wget might need to revisit directories in random order. I'm not sure if a tar stream is allowed to do that. You can add stuff to a tar stream in a pretty much random order - that's effectively what you get when you use tar's -r option to append to the end of an existing archive. (I used to use that with tapes quite often, once upon a time.)
Re: Name changing
Caddell, Travis [EMAIL PROTECTED] writes: I'm stuck with Windows at my office :( But what option offered by wget would allow the user to specify the name of the folder that the web site would be saved in. For example if I were to wget -cdr www.cnn.com the folder would be named www.cnn.com BUT for instance if I wanted to specify the name of the folder to be saved as MyNews - is there a command for that? Sure. Use `-P MyNews' -- then it will be saved to MyNews/www.cnn.com/... You can further disable the host name component in the file name with `-nH', in which case it will be just `MyNews/...'. Another question along the same lines - If I wanted to use an input file wget -i C:\MyNewsLinks.txt and then save each one in their own folder Cnn (instead of www.cnn.com) WashPost (instead of www.washingtonpost.com) - Is there a command for that too. I don't think that's possible.
Feature request
Hello, I'd like to know if there is a simple way to 'mirror' only the images from a galley (ie. without thumbnails). Maybe a new feature could be useful. This could be done throught this ways: - mirroring only images that are a link - mirroring only 'last' links from a tree - a more general option that would allow to execute a script/command to know if the file has to be saved (using the filename, or the content -may be another option). This could help saving some disk space when mirroring many sites or big sites. Maybe this option could be used to adapt (renaming) filenames 'on-the-fly'. Some galleries can't be mirror througt these ways, but many can It also seems these options are incompatible: --continue with --recursive This could be useful, imho. PS: sorry if these questions are recurent, I don't have time to read every post (on the news server at sunsite.dk) and I haven't seen any FAQ. Thanks, Frederic Lochon