Re: wget - tracking urls/web crawling

2006-06-22 Thread Frank McCown
bruce wrote: hi... i'm testing wget on a test site.. i'm using the recursive function of wget to crawl through a portion of the site... it appears that wget is hitting a link within the crawl that's causing it to begin to crawl through the section of the site again... i know wget isn't as

Limit file size

2006-06-20 Thread Frank McCown
I would like to crawl several websites and limit the total number of bytes per downloaded file to 5 MB, just in case I run into some files that are really large. From what I understand after reading through the wget manual, the --quota option could be used to limit the total number of bytes

Warrick to reconstruct lost websites

2006-06-16 Thread Frank McCown
Some of you may be interested to learn about Warrick, a tool for reconstructing lost websites from the Internet Archive, Google, MSN, and Yahoo: http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html Warrick operates similar to Wget, using many of the same parameters. I call Warrick a

Re: Background image not uploaded by wget

2006-06-01 Thread Frank McCown
Support for CSS has been on the wish list for some time. I don't think anyone is working on a patch right now. Frank Equipe web wrote: Hello, Here is another bug that would be nice to correct in Wget : some background images are not imported. For example take a look at this piece of

Re: Bug report

2006-04-01 Thread Frank McCown
Gary Reysa wrote: Hi, I don't really know if this is a Wget bug, or some problem with my website, but, either way, maybe you can help. I have a web site ( www.BuildItSolar.com ) with perhaps a few hundred pages (260MB of storage total). Someone did a Wget on my site, and managed to log

Re: Download all the necessary files and linked images

2006-03-09 Thread Frank McCown
Jean-Marc MOLINA wrote: Hello, I want to archive a HTML page and « all the files that are necessary to properly display » it (Wget manual), plus all the linked images (a href=linked_image_urlimg src=inlined_image_url/a). I tried most options and features : recursive archiving, including and

Re: problem with downloading when HREF has ../

2006-02-26 Thread Frank McCown
Vladimir Volovich wrote: DV == Dmitry Vereschaka writes: suppose that i run wget -r -l 1 http://some-host.com/index.html and index.html contains a link like this: A HREF=../directory/file.htmlfile/A DV URL ../directory/file.html placed in DV

Suggestion for documentation

2006-02-17 Thread Frank McCown
It may be useful to add a paragraph to the manual which lets users know they can use the --debug option to see why certain URLs are not followed (rejected) by wget. It would be especially useful to mention this in 9.1 Robot Exclusion. Something like this: If you wish to see which URLs are

Re: Not parsed A tags

2006-02-09 Thread Frank McCown
Lauras chaosas wrote: When using --recursive for site mirroring html pages are parsed. Anchors like this a href=... are parsed ok, but when between a and href=... there are some properties specified, these tags are not parsed, ex. a class=... href=... Thanks for usefull program, good luck.

Case insensitive enhancement

2006-01-05 Thread Frank McCown
I'd like to suggest an enhancement that would help people who are downloading web sites housed on a Windows server. (I couldn't find any discussion of this in the email list archive or any mention in the on-line documentation.) Since Windows has a case insensitive file system, Apache and IIS

Re: Handling of .. in url

2005-12-02 Thread Frank McCown
Hrvoje Niksic wrote: Frank McCown [EMAIL PROTECTED] writes: Earlier today I sent an email explaining that wget already handles .. in the middle of a URL correctly, it just doesn't handle .. immediately after the domain name correctly. But it does, at least according to rfc1808, which

Handling of .. in url

2005-12-01 Thread Frank McCown
Apache does not allow a URL to attempt access above the public_html location. Example: http://www.gnu.org/../software/wget/manual/wget.html will cause a Bad Request page to be generated because the .. in the URL. But IIS does not handle .. the same way. IIS will simply ignore .. and

Re: Handling of .. in url

2005-12-01 Thread Frank McCown
Hrvoje Niksic wrote: Frank McCown [EMAIL PROTECTED] writes: But IIS does not handle .. the same way. IIS will simply ignore .. and produce the page. So the following two URLs are referencing the same HTML page: http://www.merseyfire.gov.uk/pages/fire_auth/councillors.htm and http

Re: Limit time to run

2005-11-30 Thread Frank McCown
: Frank McCown wrote: It would be great if wget had a way of limiting the amount of time it took to run so it won't accidentally hammer on someone's web server for an indefinate amount of time. I'm often needing to let a crawler run for a while on an unknown site, and I have to manually kill wget

Re: Limit time to run

2005-11-30 Thread Frank McCown
-Original Message- From: Mauro Tortonesi [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 30, 2005 12:02 PM To: Frank McCown Cc: wget@sunsite.dk Subject: Re: Limit time to run Frank McCown wrote: It would be great if wget had a way of limiting the amount of time it took to run so it won't

Re: Limit time to run

2005-11-30 Thread Frank McCown
From what I understand, killing wget processes may result in resource leaks. Really? What kind of resource leaks are you referring to? Wget does not create temporary files, nor does it allocate external resources other than dynamically allocated memory and network connections, both of which

Limit time to run

2005-11-29 Thread Frank McCown
It would be great if wget had a way of limiting the amount of time it took to run so it won't accidentally hammer on someone's web server for an indefinate amount of time. I'm often needing to let a crawler run for a while on an unknown site, and I have to manually kill wget after a few hours

Re: Test a websites availability

2005-08-19 Thread Frank McCown
... Is it possible to use WGET to test a website for being up only. I don't want to download any files just test the site for availability? Any ideas? Psudo code: wget www.google.com. Is connected? YES/NO. If YES write answer to a log. If No write to a log and timestamp it. Thanks -- Frank McCown

Re: robots.txt takes precedence over -p

2005-08-09 Thread Frank McCown
is not an absolute requirement, it is considered polite. I would not want the default behavior of wget to be considered impolite. IMVHO, hrvoje has a good point when he says that wget behaves like a web browser and, as such, should not required to respect the robots standard. -- Frank McCown

Re: Question

2005-08-09 Thread Frank McCown
. That means all the images, Tables, can be saved as a file. It is called as Web Archieve, single file (*.mht). Does it possible for wget ? not at the moment, but it's a planned feature for wget 2.0. Really? I've never heard of a .mht web archive, it seems a Windows-only thing. -- Frank McCown Old

Option to limit number of files downloaded

2005-07-19 Thread Frank McCown
It would be nice to have an option to limit the number of files downloaded. This can be very useful when doing a wget on a site that produces dynamic web pages. Sometimes these sites could have a scripting error that produces an infinite amount of pages that wget will download. Left

Re: wget a file with long path on Windows XP

2005-07-13 Thread Frank McCown
After logging in, then the website becomes similar to booksonline.com which I edit slightly. My public library's electronic access which also require logging in. --- Frank McCown [EMAIL PROTECTED] wrote: Putting quotes around the url got rid of your Invalid parameter errors. I just tried

Re: wget a file with long path on Windows XP

2005-07-12 Thread Frank McCown
/0] --- Frank McCown [EMAIL PROTECTED] wrote: I think you need to put quotes around the url. PoWah Wong wrote: The file I want to get is http://proquest.booksonline.com/JVXSL.asp?x=1mode=sectionsortKey=ranksortOrder=descview=bookxmlid=0-321-16076-2/ch02g=srchText=object+orientedcode=h=m

Stopping index.html?N=D etc. from being stored

2005-06-29 Thread Frank McCown
. -- Frank McCown Old Dominion University http://www.cs.odu.edu/~fmccown

Prereq doesn't download image in stylesheet

2005-06-14 Thread Frank McCown
. Thanks, Frank McCown Old Dominion University http://www.cs.odu.edu/~fmccown