Re: [Bug-wget] HTTP quota bug
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andrey Semenchuk wrote: > Hi! > > As it described in documentation, when --quota option is used "download > will be aborted when the quota is exceeded". But HTTP code has no > corresponding lines to break download unlike FTP code. So, if some file > is downloaded via HTTP, it will be fully downloaded and stored (no > matter is --quota option used or not) but with additional warning when > quota is exceeded: "Download quota (... bytes) EXCEEDED!" What documentation are you talking about? This is what I see: `-Q QUOTA' `--quota=QUOTA' Specify download quota for automatic retrievals. The value can be specified in bytes (default), kilobytes (with `k' suffix), or megabytes (with `m' suffix). Note that quota will never affect downloading a single file. So if you specify `wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz', all of the `ls-lR.gz' will be downloaded. The same goes even when several URLs are specified on the command-line. However, quota is respected when retrieving either recursively, or from an input file. Thus you may safely type `wget -Q2m -i sites'--download will be aborted when the quota is exceeded. Setting quota to 0 or to `inf' unlimits the download quota. Which is exactly the case, whether you're talking FTP, or HTTP. It doesn't break download in the middle of a file. Which, yeah, I agree is counter-intuitive. But with a program like wget, I can never be sure that changing this won't break someone's script somewhere. Not that we shouldn't do proper quotas, but we most likely need to add that feature as a different option. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. Maintainer of GNU Wget and GNU Teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkn6mq8ACgkQ7M8hyUobTrG7WwCfdRJPPizDdCgVPzzTPnmttQO9 vccAnAmT4jxUuORMkchVra4IeggpAdU9 =8BHG -END PGP SIGNATURE-
[Bug-wget] HTTP quota bug
Hi! As it described in documentation, when --quota option is used "download will be aborted when the quota is exceeded". But HTTP code has no corresponding lines to break download unlike FTP code. So, if some file is downloaded via HTTP, it will be fully downloaded and stored (no matter is --quota option used or not) but with additional warning when quota is exceeded: "Download quota (... bytes) EXCEEDED!" -- Best wishes, Andrey Semenchuk Trifle Co., Ltd.
Re: [Bug-wget] download page-requisites with spanning hosts
On Thu, Apr 30, 2009 at 03:31:21AM -0500, Jake b wrote: > On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar wrote: > > > > On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote: > but i'm not sure how to tell wget that the output html file should be named. > wget -O OUTPUT_FILE_NAME > > > How do I make wget download all images on the page? I don't want to > > > recurse other hosts, or even sijun, just download this page, and all > > > images needed to display it. > > > > > That's not easy task. Especially because all big desktop images are stored > > on other servers. I think wget is not enough powerfull to do it all on its > > own. > > Are you saying because some services show a thumbnail, then click to do the > full image? […] > Would it be simpler to say something like: download page 912, recursion > level=1 ( or 2? ), except for non-image links. ( so it only allows recursion > on images, ie: downloading "randomguyshost.com/3.png" > You can limit downloads according file name extentions (option -A), however this will remove the sole main HTML file and prevent you in recursion. And no, there is no option to download only files pointed from special HTML element like IMG. Without the -A option, you get a lot of useless files (reagardless spanning). If you look on locations of files you are interested in, you will see that all the files are located outside the Sijun domain. On every page is only small amount of such files. Thus it's more efficient and friendly to the servers to extract these URLs only at first and then download them only. > But the problem that it does not span any hosts? Is there a way I can > achieve this, if I do the same, except, allow span everybody, recurse > lvl=1, and only recurse non-images. > There is option -H for spanning. Following wget-only command does what you want, but as I said it produce a lot of useless requests and files. wget -p -l 1 -H 'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330' > > I propose using other tools to extract the image ULRs and then to download > > them using wget. E.g.: > > I guess I could use wget to get the html, and parse that for image tags > manually, but, then I don't get the forum thread comments. Which isn't > required, but would be nice. You can do both: extract image URLs and extract comments. > > > wget -O > > - > > 'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330' > > | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i - > > > Ok, will have to try it out. ( In windows ATM so I can't pipe. ) > AFAIK Windows shells command.com and cmd.exe supports pipes. > Using python, and I have dual boot if needed. > Or you can execute programs connected through pipes in Python. -- Petr pgpkvyaPL0mO2.pgp Description: PGP signature
Re: [Bug-wget] download page-requisites with spanning hosts
On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar wrote: > > On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote: > > Instead of creating something like: "912.html" or "index.html" it instead > > becomes: "viewtopic@t=29807&postdays=0&postorder=asc&start=27330" > > > That't normal because the server doean't provide any usefull alternative name > via HTTP headers which can be obtained using wget's option > "--content-disposition". I already know how to get the page number ( my python script converts 27330 to 912 and back ), but i'm not sure how to tell wget that the output html file should be named. > > How do I make wget download all images on the page? I don't want to > > recurse other hosts, or even sijun, just download this page, and all > > images needed to display it. > > > That's not easy task. Especially because all big desktop images are stored on > other servers. I think wget is not enough powerfull to do it all on its own. Are you saying because some services show a thumbnail, then click to do the full image? I'm not worried about that, since the majority are full size in the thread. Would it be simpler to say something like: download page 912, recursion level=1 ( or 2? ), except for non-image links. ( so it only allows recursion on images, ie: downloading "randomguyshost.com/3.png" But the problem that it does not span any hosts? Is there a way I can achieve this, if I do the same, except, allow span everybody, recurse lvl=1, and only recurse non-images. > I propose using other tools to extract the image ULRs and then to download > them > using wget. E.g.: I guess I could use wget to get the html, and parse that for image tags manually, but, then I don't get the forum thread comments. Which isn't required, but would be nice. > wget -O - > 'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330' > | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i - > Ok, will have to try it out. ( In windows ATM so I can't pipe. ) > Acctually, I suppose you use some unix enviroment, where you have available > powerfull collection of external tools (grep, seq) and amazing shell scripting > abilities (like colons and loops). > > -- Petr Using python, and I have dual boot if needed. -- Jake
Re: [Bug-wget] download page-requisites with spanning hosts
On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote: > > The wGet command I am using: > wget.exe -p -k -w 15 > "http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27330"; > > It has 2 problems: > > 1) Rename file: > > Instead of creating something like: "912.html" or "index.html" it instead > becomes: "viewtopic@t=29807&postdays=0&postorder=asc&start=27330" > That't normal because the server doean't provide any usefull alternative name via HTTP headers which can be obtained using wget's option "--content-disposition". If you want get number of the page of the gallery, you need to parse the HTML code by hand to obtain it (e.g. using grep). However I guess better naming conventions is the value of "start" URL parameter (in your example the number 27330). > 2) images that span hosts are failing. > > I have page-resuisites on, but, since some pages are on tinypic, or > imageshack, etc it is not downloading them. Meaning it looks like > this: > > sijun/page912.php > imageshack.com/1.png > tinypic.com/2.png > randomguyshost.com/3.png > > > Because of this, I cannot simply list all domains to span. I don't > know all the domains, since people have personal servers. > > How do I make wget download all images on the page? I don't want to > recurse other hosts, or even sijun, just download this page, and all > images needed to display it. > That's not easy task. Especially because all big desktop images are stored on other servers. I think wget is not enough powerfull to do it all on its own. I propose using other tools to extract the image ULRs and then to download them using wget. E.g.: wget -O - 'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330' | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i - This command downloads the HTML code, uses grep to find out all image files stored on other servers (deciding throug file name extensions and absolute addresses), and finally, it downloads such images. There is little problem: not all of the images still exist and some servers return dummy page instead of proper error code. So you can get non-image files sometimes. > [ This one is a lower priority, but someone might already know how to > solve this ] > 3) After this is done, I want to loop to download multiple pages. It > would be cool If I downloaded pages 900 to 912, and each pages next > link work correctly to link to the local versions. > […] > Either way, I have a simple script that can convert 900 to 912 into > the correct URLs, and pausing in between each request. > Wrap your script inside counted for-loop: for N in $(seq 900 912); do # variable N contains here the right number echo "$N" done Acctually, I suppose you use some unix enviroment, where you have available powerfull collection of external tools (grep, seq) and amazing shell scripting abilities (like colons and loops). -- Petr pgptCJIEFQP9c.pgp Description: PGP signature