Re: [Bug-wget] HTTP quota bug

2009-04-30 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andrey Semenchuk wrote:
> Hi!
> 
> As it described in documentation, when --quota option is used "download
> will be aborted when the quota is exceeded". But HTTP code has no
> corresponding lines to break download unlike FTP code. So, if some file
> is downloaded via HTTP, it will be fully downloaded and stored (no
> matter is --quota option used or not) but with additional warning when
> quota is exceeded:  "Download quota (... bytes) EXCEEDED!"

What documentation are you talking about? This is what I see:

`-Q QUOTA'
`--quota=QUOTA'
 Specify download quota for automatic retrievals.  The value can be
 specified in bytes (default), kilobytes (with `k' suffix), or
 megabytes (with `m' suffix).

 Note that quota will never affect downloading a single file.  So
 if you specify `wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz',
 all of the `ls-lR.gz' will be downloaded.  The same goes even when
 several URLs are specified on the command-line.  However, quota is
 respected when retrieving either recursively, or from an input
 file.  Thus you may safely type `wget -Q2m -i sites'--download
 will be aborted when the quota is exceeded.

 Setting quota to 0 or to `inf' unlimits the download quota.

Which is exactly the case, whether you're talking FTP, or HTTP. It
doesn't break download in the middle of a file.

Which, yeah, I agree is counter-intuitive. But with a program like wget,
I can never be sure that changing this won't break someone's script
somewhere. Not that we shouldn't do proper quotas, but we most likely
need to add that feature as a different option.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkn6mq8ACgkQ7M8hyUobTrG7WwCfdRJPPizDdCgVPzzTPnmttQO9
vccAnAmT4jxUuORMkchVra4IeggpAdU9
=8BHG
-END PGP SIGNATURE-




[Bug-wget] HTTP quota bug

2009-04-30 Thread Andrey Semenchuk

Hi!

As it described in documentation, when --quota option is used "download 
will be aborted when the quota is exceeded". But HTTP code has no 
corresponding lines to break download unlike FTP code. So, if some file 
is downloaded via HTTP, it will be fully downloaded and stored (no 
matter is --quota option used or not) but with additional warning when 
quota is exceeded:  "Download quota (... bytes) EXCEEDED!"


--
Best wishes,
Andrey Semenchuk
Trifle Co., Ltd.




Re: [Bug-wget] download page-requisites with spanning hosts

2009-04-30 Thread Petr Pisar
On Thu, Apr 30, 2009 at 03:31:21AM -0500, Jake b wrote:
> On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar  wrote:
> >
> > On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
> but i'm not sure how to tell wget that the output html file should be named.
> 
wget -O OUTPUT_FILE_NAME

> > > How do I make wget download all images on the page? I don't want to
> > > recurse other hosts, or even sijun, just download this page, and all
> > > images needed to display it.
> > >
> > That's not easy task. Especially because all big desktop images are stored
> > on other servers. I think wget is not enough powerfull to do it all on its
> > own.
> 
> Are you saying because some services show a thumbnail, then click to do the
> full image? 
[…]
> Would it be simpler to say something like: download page 912, recursion
> level=1 ( or 2? ), except for non-image links. ( so it only allows recursion
> on images, ie: downloading "randomguyshost.com/3.png"
> 
You can limit downloads according file name extentions (option -A), however
this will remove the sole main HTML file and prevent you in recursion. And
no, there is no option to download only files pointed from special HTML
element like IMG.

Without the -A option, you get a lot of useless files (reagardless spanning).

If you look on locations of files you are interested in, you will see that all
the files are located outside the Sijun domain. On every page is only small
amount of such files. Thus it's more efficient and friendly to the servers to
extract these URLs only at first and then download them only.

> But the problem that it does not span any hosts? Is there a way I can
> achieve this, if I do the same, except, allow span everybody, recurse
> lvl=1, and only recurse non-images.
>
There is option -H for spanning. Following wget-only command does what you
want, but as I said it produce a lot of useless requests and files.

wget -p -l 1 -H 
'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330'
 


> > I propose using other tools to extract the image ULRs and then to download
> > them using wget. E.g.:
> 
> I guess I could use wget to get the html, and parse that for image tags
> manually, but, then I don't get the forum thread comments. Which isn't
> required, but would be nice.

You can do both: extract image URLs and extract comments.

> 
> > wget -O
> > - 
> > 'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330'
> > | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i -
> >
> Ok, will have to try it out. ( In windows ATM so I can't pipe. )
> 
AFAIK Windows shells command.com and cmd.exe supports pipes.

> Using python, and I have dual boot if needed.
> 
Or you can execute programs connected through pipes in Python.

-- Petr


pgpkvyaPL0mO2.pgp
Description: PGP signature


Re: [Bug-wget] download page-requisites with spanning hosts

2009-04-30 Thread Jake b
On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar  wrote:
>
> On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
> > Instead of creating something like: "912.html" or "index.html" it instead
> > becomes: "viewtopic@t=29807&postdays=0&postorder=asc&start=27330"
> >
> That't normal because the server doean't provide any usefull alternative name
> via HTTP headers which can be obtained using wget's option
> "--content-disposition".

I already know how to get the page number ( my python script converts
27330 to 912 and back ), but i'm not sure how to tell wget that the
output html file should be named.

> > How do I make wget download all images on the page? I don't want to
> > recurse other hosts, or even sijun, just download this page, and all
> > images needed to display it.
> >
> That's not easy task. Especially because all big desktop images are stored on
> other servers. I think wget is not enough powerfull to do it all on its own.

Are you saying because some services show a thumbnail, then click to
do the full image? I'm not worried about that, since the majority are
full size in the thread.

Would it be simpler to say something like: download page 912,
recursion level=1 ( or 2? ), except for non-image links. ( so it only
allows recursion on images, ie: downloading "randomguyshost.com/3.png"

But the problem that it does not span any hosts? Is there a way I can
achieve this, if I do the same, except, allow span everybody, recurse
lvl=1, and only recurse non-images.

> I propose using other tools to extract the image ULRs and then to download 
> them
> using wget. E.g.:

I guess I could use wget to get the html, and parse that for image
tags manually, but, then I don't get the forum thread comments. Which
isn't required, but would be nice.

> wget -O - 
> 'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330'
>  | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i -
>
Ok, will have to try it out. ( In windows ATM so I can't pipe. )

> Acctually, I suppose you use some unix enviroment, where you have available
> powerfull collection of external tools (grep, seq) and amazing shell scripting
> abilities (like colons and loops).
>
> -- Petr

Using python, and I have dual boot if needed.

--
Jake




Re: [Bug-wget] download page-requisites with spanning hosts

2009-04-30 Thread Petr Pisar
On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
> 
> The wGet command I am using:
> wget.exe -p -k -w 15
> "http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27330";
> 
> It has 2 problems:
> 
> 1) Rename file:
> 
> Instead of creating something like: "912.html" or "index.html" it instead
> becomes: "viewtopic@t=29807&postdays=0&postorder=asc&start=27330"
>
That't normal because the server doean't provide any usefull alternative name
via HTTP headers which can be obtained using wget's option
"--content-disposition".

If you want get number of the page of the gallery, you need to parse the HTML
code by hand to obtain it (e.g. using grep).

However I guess better naming conventions is the value of "start" URL
parameter (in your example the number 27330).

> 2) images that span hosts are failing.
> 
> I have page-resuisites on, but, since some pages are on tinypic, or
> imageshack, etc it is not downloading them. Meaning it looks like
> this:
> 
> sijun/page912.php
>   imageshack.com/1.png
>   tinypic.com/2.png
>   randomguyshost.com/3.png
> 
> 
> Because of this, I cannot simply list all domains to span. I don't
> know all the domains, since people have personal servers.
> 
> How do I make wget download all images on the page? I don't want to
> recurse other hosts, or even sijun, just download this page, and all
> images needed to display it.
> 
That's not easy task. Especially because all big desktop images are stored on
other servers. I think wget is not enough powerfull to do it all on its own.

I propose using other tools to extract the image ULRs and then to download them
using wget. E.g.:

wget -O - 
'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330'
 | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i -

This command downloads the HTML code, uses grep to find out all image files
stored on other servers (deciding throug file name extensions and absolute
addresses), and finally, it downloads such images.

There is little problem: not all of the images still exist and some servers
return dummy page instead of proper error code. So you can get non-image files
sometimes.

> [ This one is a lower priority, but someone might already know how to
> solve this ]
> 3) After this is done, I want to loop to download multiple pages. It
> would be cool If I downloaded pages 900 to 912, and each pages next
> link work correctly to link to the local versions.
> 
[…]
> Either way, I have a simple script that can convert 900 to 912 into
> the correct URLs, and pausing in between each request.
> 
Wrap your script inside counted for-loop:

for N in $(seq 900 912); do
# variable N contains here the right number
echo "$N"
done

Acctually, I suppose you use some unix enviroment, where you have available
powerfull collection of external tools (grep, seq) and amazing shell scripting
abilities (like colons and loops).

-- Petr


pgptCJIEFQP9c.pgp
Description: PGP signature