Phil Lewis
Sun, 13 Jun 2004 20:08:05 -0700
That works for me! The command line you sent, that is. So, the user-agent is arbitrary? Can be anything?
Thanks very much for your help. -----Original Message----- From: Robert Pendell [mailto:[EMAIL PROTECTED] Sent: Sunday, June 13, 2004 5:36 PM To: Phil Lewis Subject: Re: Cannot WGet Google Search Page? -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I figured it out. Just so you know though. You are not doing anything wrong. This occured even without any switches. I discovered that google must be blocking the user-agent that wget uses so that people can't retrieve the search pages. The way around it is to tell wget to use an alternate user-agent. You can use mine or go to http://www.gemal.dk/browserspy/basic.html and look at the UserAgent line for the one to your browser. The following command line is the same as yours only has the additional user-agent command switch. You can replace mine with your though. "c:\program files\wget\wget" -r -N -t2 -l2 -E -e robots=off - -awGet.log -T 200 -H -Priserless -U "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040613 Firefox/0.8.0+" http://www.google.com/search?q=riserless Likewise here is part of the log. - --- begin wget.log excerpt - --18:34:31-- http://www.google.com/search?q=riserless ~ => `riserless/www.google.com/[EMAIL PROTECTED]' Resolving www.google.com... 216.239.51.147, 216.239.51.99, 216.239.51.104 Connecting to www.google.com[216.239.51.147]:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] ~ 0K .......... ..... 8.40 KB/s Last-modified header missing -- time-stamps turned off. 18:34:34 (8.40 KB/s) - `riserless/www.google.com/[EMAIL PROTECTED]' saved [15602] - --- end wget.log excerpt. Hope this is helpful1 - -- Robert Pendell [EMAIL PROTECTED] Phil Lewis wrote: | Jens, thank you for your response! Here's my command line: | | "c:\program files\wget\wget" -r -N -t2 -l2 -E -e robots=off - -awGet.log -T | 200 -H -Priserless http://www.google.com/search?q=riserless | | I have tried the URL in single quotes, double quotes and no quotes with the | same result: A 403 Forbidden error. The logfile is given below. Thank your | for your help! | | --12:41:25-- http://www.google.com/search?q=riserless | => `riserless/www.google.com/[EMAIL PROTECTED]' | Resolving www.google.com... 64.233.167.104, 64.233.167.99 Connecting | to www.google.com[64.233.167.104]:80... connected. HTTP request sent, | awaiting response... 403 Forbidden 12:41:26 ERROR 403: Forbidden. | | | FINISHED --12:41:26-- | Downloaded: 0 bytes in 0 files | | -----Original Message----- | From: Jens Rösner [mailto:[EMAIL PROTECTED] | Sent: Saturday, June 12, 2004 11:30 AM | To: Phil Lewis | Cc: [EMAIL PROTECTED] | Subject: Re: Cannot WGet Google Search Page? | | | Hi Phil! | | Without more info (wget's verbose or even debug output, full command | line,...) I find it hard to tell what is happening. | However, I have had very good success with wget and google. So, some | hints: 1. protect the google URL by enclosing it in " | 2. remember to span (and allow only certain) hosts, otherwise, wget will | only download google pages | And lastly -but you obviously did so- think about restricting the recursion | depth. | | Hope that helps a bit | Jens | | > I have been trying to wget several levels deep from a Google search page | |>(e.g., http://www.google.com/search?=deepwater+oil). But on the very |>first page, wget returns a 403 Forbidden error and stops. Anyone know |>how I can get around this? |> |>Regards, Phil |>Philip E. Lewis, P.E. |>[EMAIL PROTECTED] |> |> | | -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (MingW32) iD8DBQFAzNbrhdWaw0WCfXURApHEAJ9LndvbxVQi1kAdXR3JAeNbhABxhQCfX/P2 untN63xPYqJ+Swquh68wpHU= =IjW1 -----END PGP SIGNATURE-----