bruce wrote:
hi...
i'm testing wget on a test site.. i'm using the recursive function of wget
to crawl through a portion of the site...
it appears that wget is hitting a link within the crawl that's causing it to
begin to crawl through the section of the site again...
i know wget isn't as
I would like to crawl several websites and limit the total number of
bytes per downloaded file to 5 MB, just in case I run into some files
that are really large.
From what I understand after reading through the wget manual, the
--quota option could be used to limit the total number of bytes
Some of you may be interested to learn about Warrick, a tool for
reconstructing lost websites from the Internet Archive, Google, MSN, and
Yahoo:
http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html
Warrick operates similar to Wget, using many of the same parameters. I
call Warrick a
Support for CSS has been on the wish list for some time. I don't think
anyone is working on a patch right now.
Frank
Equipe web wrote:
Hello,
Here is another bug that would be nice to correct in Wget : some
background images are not imported.
For example take a look at this piece of
Gary Reysa wrote:
Hi,
I don't really know if this is a Wget bug, or some problem with my
website, but, either way, maybe you can help.
I have a web site ( www.BuildItSolar.com ) with perhaps a few hundred
pages (260MB of storage total). Someone did a Wget on my site, and
managed to log
Jean-Marc MOLINA wrote:
Hello,
I want to archive a HTML page and « all the files that are necessary to
properly display » it (Wget manual), plus all the linked images (a
href=linked_image_urlimg src=inlined_image_url/a). I tried most
options and features : recursive archiving, including and
Vladimir Volovich wrote:
DV == Dmitry Vereschaka writes:
suppose that i run
wget -r -l 1 http://some-host.com/index.html
and index.html contains a link like this:
A HREF=../directory/file.htmlfile/A
DV URL ../directory/file.html placed in
DV
It may be useful to add a paragraph to the manual which lets users know
they can use the --debug option to see why certain URLs are not followed
(rejected) by wget. It would be especially useful to mention this in
9.1 Robot Exclusion. Something like this:
If you wish to see which URLs are
Lauras chaosas wrote:
When using --recursive for site mirroring html pages are parsed. Anchors
like this a href=... are parsed ok, but when between a and
href=... there are some properties specified, these tags are not
parsed, ex. a class=... href=...
Thanks for usefull program,
good luck.
I'd like to suggest an enhancement that would help people who are
downloading web sites housed on a Windows server. (I couldn't find any
discussion of this in the email list archive or any mention in the
on-line documentation.)
Since Windows has a case insensitive file system, Apache and IIS
Hrvoje Niksic wrote:
Frank McCown [EMAIL PROTECTED] writes:
Earlier today I sent an email explaining that wget already handles
.. in the middle of a URL correctly, it just doesn't handle ..
immediately after the domain name correctly.
But it does, at least according to rfc1808, which
Apache does not allow a URL to attempt access above the public_html
location. Example:
http://www.gnu.org/../software/wget/manual/wget.html
will cause a Bad Request page to be generated because the .. in the
URL.
But IIS does not handle .. the same way. IIS will simply ignore ..
and
Hrvoje Niksic wrote:
Frank McCown [EMAIL PROTECTED] writes:
But IIS does not handle .. the same way. IIS will simply ignore
.. and produce the page. So the following two URLs are referencing
the same HTML page:
http://www.merseyfire.gov.uk/pages/fire_auth/councillors.htm
and
http
:
Frank McCown wrote:
It would be great if wget had a way of limiting the amount of time it
took to run so it won't accidentally hammer on someone's web server
for an indefinate amount of time. I'm often needing to let a crawler
run for a while on an unknown site, and I have to manually kill wget
-Original Message-
From: Mauro Tortonesi [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 30, 2005 12:02 PM
To: Frank McCown
Cc: wget@sunsite.dk
Subject: Re: Limit time to run
Frank McCown wrote:
It would be great if wget had a way of limiting the amount of time it
took to run so it won't
From what I understand, killing wget processes may result in resource
leaks.
Really? What kind of resource leaks are you referring to? Wget does
not create temporary files, nor does it allocate external resources
other than dynamically allocated memory and network connections, both
of which
It would be great if wget had a way of limiting the amount of time it
took to run so it won't accidentally hammer on someone's web server for
an indefinate amount of time. I'm often needing to let a crawler run
for a while on an unknown site, and I have to manually kill wget after a
few hours
...
Is it possible to use WGET to test a website for being up only. I don't want to download any files just test the site for availability? Any ideas?
Psudo code: wget www.google.com. Is connected? YES/NO. If YES write answer to a log. If No write to a log and timestamp it.
Thanks
--
Frank McCown
is not an
absolute requirement, it is considered polite. I would not want the
default behavior of wget to be considered impolite.
IMVHO, hrvoje has a good point when he says that wget behaves like a web
browser and, as such, should not required to respect the robots
standard.
--
Frank McCown
. That means all the
images,
Tables, can be saved as a file. It is called as Web Archieve, single file
(*.mht).
Does it possible for wget ?
not at the moment, but it's a planned feature for wget 2.0.
Really? I've never heard of a .mht web archive, it seems a
Windows-only thing.
--
Frank McCown
Old
It would be nice to have an option to limit the number of files
downloaded. This can be very useful when doing a wget on a site that
produces dynamic web pages. Sometimes these sites could have a
scripting error that produces an infinite amount of pages that wget will
download. Left
After logging in, then the website becomes similar to
booksonline.com which I edit slightly.
My public library's electronic access which also
require logging in.
--- Frank McCown [EMAIL PROTECTED] wrote:
Putting quotes around the url got rid of your
Invalid parameter errors.
I just tried
/0]
--- Frank McCown [EMAIL PROTECTED] wrote:
I think you need to put quotes around the url.
PoWah Wong wrote:
The file I want to get is
http://proquest.booksonline.com/JVXSL.asp?x=1mode=sectionsortKey=ranksortOrder=descview=bookxmlid=0-321-16076-2/ch02g=srchText=object+orientedcode=h=m
.
--
Frank McCown
Old Dominion University
http://www.cs.odu.edu/~fmccown
.
Thanks,
Frank McCown
Old Dominion University
http://www.cs.odu.edu/~fmccown
25 matches
Mail list logo