wget  

RE: Cannot WGet Google Search Page?

Phil Lewis
Sun, 13 Jun 2004 20:08:05 -0700

That works for me! The command line you sent, that is. So, the user-agent is
arbitrary? Can be anything?

Thanks very much for your help.

-----Original Message-----
From: Robert Pendell [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 13, 2004 5:36 PM
To: Phil Lewis
Subject: Re: Cannot WGet Google Search Page?


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I figured it out.  Just so you know though.  You are not doing anything
wrong.  This occured even without any switches.  I discovered that google
must be blocking the user-agent that wget uses so that people can't retrieve
the search pages.  The way around it is to tell wget to use an alternate
user-agent.  You can use mine or go to
http://www.gemal.dk/browserspy/basic.html and look at the UserAgent line for
the one to your browser.  The following command line is the same as yours
only has the additional user-agent command switch.  You can replace mine
with your though.

"c:\program files\wget\wget" -r -N -t2 -l2 -E -e robots=off
- -awGet.log -T 200 -H -Priserless -U "Mozilla/5.0 (Windows; U; Windows NT
5.1; en-US; rv:1.7) Gecko/20040613 Firefox/0.8.0+"
http://www.google.com/search?q=riserless

Likewise here is part of the log.
- --- begin wget.log excerpt
- --18:34:31--  http://www.google.com/search?q=riserless
~           => `riserless/www.google.com/[EMAIL PROTECTED]'
Resolving www.google.com... 216.239.51.147, 216.239.51.99, 216.239.51.104
Connecting to www.google.com[216.239.51.147]:80... connected. HTTP request
sent, awaiting response... 200 OK
Length: unspecified [text/html]

~    0K .......... .....
8.40 KB/s

Last-modified header missing -- time-stamps turned off. 18:34:34 (8.40 KB/s)
- `riserless/www.google.com/[EMAIL PROTECTED]' saved [15602]
- --- end wget.log excerpt.

Hope this is helpful1
- --
Robert Pendell
[EMAIL PROTECTED]

Phil Lewis wrote:
| Jens, thank you for your response! Here's my command line:
|
| "c:\program files\wget\wget" -r -N -t2 -l2 -E -e robots=off
- -awGet.log -T
| 200 -H -Priserless http://www.google.com/search?q=riserless
|
| I have tried the URL in single quotes, double quotes and no quotes
with the
| same result: A 403 Forbidden error. The logfile is given below.
Thank your
| for your help!
|
| --12:41:25--  http://www.google.com/search?q=riserless
|            => `riserless/www.google.com/[EMAIL PROTECTED]'
| Resolving www.google.com... 64.233.167.104, 64.233.167.99 Connecting 
| to www.google.com[64.233.167.104]:80... connected. HTTP request sent, 
| awaiting response... 403 Forbidden 12:41:26 ERROR 403: Forbidden.
|
|
| FINISHED --12:41:26--
| Downloaded: 0 bytes in 0 files
|
| -----Original Message-----
| From: Jens Rösner [mailto:[EMAIL PROTECTED]
| Sent: Saturday, June 12, 2004 11:30 AM
| To: Phil Lewis
| Cc: [EMAIL PROTECTED]
| Subject: Re: Cannot WGet Google Search Page?
|
|
| Hi Phil!
|
| Without more info (wget's verbose or even debug output, full command
| line,...) I find it hard to tell what is happening.
| However, I have had very good success with wget and google. So, some 
| hints: 1. protect the google URL by enclosing it in "
| 2. remember to span (and allow only certain) hosts, otherwise,
wget will
| only download google pages
| And lastly -but you obviously did so- think about restricting the
recursion
| depth.
|
| Hope that helps a bit
| Jens
|
|  > I have been trying to wget several levels deep from a Google
search page
|
|>(e.g., http://www.google.com/search?=deepwater+oil). But on the very 
|>first page, wget returns a 403 Forbidden error and stops. Anyone know 
|>how I can get around this?
|>
|>Regards, Phil 
|>Philip E. Lewis, P.E.
|>[EMAIL PROTECTED]
|>
|>
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)

iD8DBQFAzNbrhdWaw0WCfXURApHEAJ9LndvbxVQi1kAdXR3JAeNbhABxhQCfX/P2
untN63xPYqJ+Swquh68wpHU=
=IjW1
-----END PGP SIGNATURE-----


  • RE: Cannot WGet Google Search Page? Phil Lewis