Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls help]

Ben Smith Thu, 13 Nov 2008 07:37:41 -0800

Actually, I realized there's an easier way.  Just use this command:

Make a text file (filelist.txt), with all the addresses of the results pages:
http://www.google.com/search?q=site%3Awww.snowbrasil.com%2Ffotos&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=zle&q=site:www.snowbrasil.com/fotos&start=10&sa=N
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=o6J&q=site:www.snowbrasil.com/fotos&start=20&sa=N
etc., up to start=570 (since there are 577 results).

Then use this command (all on one line, no spaces after --exclude-domains until 
the space before --input-file):
wget -r -l1 -UFirefox -H -erobots=off --wait 1 
--exclude-domains=images.google.com,
maps.google.com,news.google.com,mail.google.com,video.google.com,groups.google.com,
books.google.com,scholar.google.com,finance.google.com,blogsearch.google.com,
www.youtube.com,picasaweb.google.com,docs.google.com,sites.google.com,
www.snowbrasil.com,translate.google.com --input-file=filelist.txt

All of your cache files will end up in a single subdirectory named after the IP 
address that hosted the cached files.  When I tested it, it was 74.125.45.104, 
but that may vary.  They are easy to identify since they have cache in the 
filename and look similar to this:
[EMAIL PROTECTED]&hl=en&ct=clnk&cd=20&gl=us&ie=UTF-8&client=firefox-a

________________________________
From: Yan Grossman <[EMAIL PROTECTED]>
To: Ben Smith <[EMAIL PROTECTED]>
Sent: Thursday, November 13, 2008 2:34:56 AM
Subject: Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls 
help]

Thanks so much for responding. Do I need to write a script with these commands 
or do I run one at a time on the command line on my server?
Would you please just tell me what the syntax is so I only download the cache 
files?
Thanks so much

On Wed, Nov 12, 2008 at 9:30 PM, Ben Smith <[EMAIL PROTECTED]> wrote:

grep is a command line program that allows you to find lines in a text file 
that contain a certain target
more info/usage: http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?grep
sed is a command line program that allows you to replace text
more info/usage: http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?sed

Any Linux distro should have these, or if you're running Windows you can get 
them at:
http://gnuwin32.sourceforge.net/packages/grep.htm
http://gnuwin32.sourceforge.net/packages/sed.htm

________________________________
From: Yan Grossman <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Wednesday, November 12, 2008 2:03:58 PM
Subject: Fwd: [Fwd: Re: [Bug-wget] Fwd: Trying to download HTML from Google's 
Cache. Pls help]

---------- Forwarded message ----------
From: Yan Grossman <[EMAIL PROTECTED]>
Date: Wed, Nov 12, 2008 at 10:49 AM
Subject: Re: [Fwd: Re: [Bug-wget] Fwd: Trying to download HTML from Google's 
Cache. Pls help]
To: Micah Cowan <[EMAIL PROTECTED]>

Thanks so much. But what does it mean "Then grep each of the results files to 
find the line with links to the

all cached pages.  You can pipe that output into sed"
I am not familiar with "grep" and "sed"

Could you please elaborate?

Thanks

On Wed, Nov 12, 2008 at 10:32 AM, Micah Cowan <[EMAIL PROTECTED]> wrote:

-------- Original Message --------
Subject: Re: [Bug-wget] Fwd: Trying to download HTML from Google's
Cache. Pls help
Date: Wed, 12 Nov 2008 10:00:34 -0800 (PST)
From: Ben Smith <[EMAIL PROTECTED]>
To: Micah Cowan <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>

Adding -UFirefox allows the download.  So you should first wget
-UFirefox all the listed results pages from Google:
http://www.google.com/search?q=site%3Awww.snowbrasil.com%2Ffotos&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=zle&q=site:www.snowbrasil.com/fotos&start=10&sa=N
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=o6J&q=site:www.snowbrasil.com/fotos&start=20&sa=N

etc., up to start=570 (since there are 577 results).

Then grep each of the results files to find the line with links to the
all cached pages.  You can pipe that output into sed, which you can use
to remove everything but the links to the cached pages (replace the info
before, after, and between the cache links with a space).  Then simply
pipe that to wget -UFirefox, and you should get all your files.

----- Original Message ----
> From: Micah Cowan <[EMAIL PROTECTED]>
> To: Ben Smith <[EMAIL PROTECTED]>
> Cc: [email protected]
> Sent: Tuesday, November 11, 2008 3:27:05 PM
> Subject: Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls 
> help
>
> Ben Smith wrote:
>
>> Subject: Re: [Bug-wget] Re: Bug-wget Digest, Vol 1, Issue 10
>
>>> When replying, please edit your Subject line so it is more specific
>>>  than "Re: Contents of Bug-wget digest..."
>
> It's helpful if you adhere to this guideline; otherwise it's hard to
> follow threads. (I've fixed the subject in my reply.)
>
>> It would be theoretically possible by using grep and sed to strip out
>> the links to the cached files and piping that to wget.  However,
>> Google appears to block access to results pages and cached pages via
>> wget.  I tried to download several using wget and got a 403 Forbidden
>> response.
>
> http://wget.addictivecode.org/FrequentlyAskedQuestions#not-downloading
> should be helpful for such problems (using -U is the most applicable
> suggestion, but you may also run into the others). Please also consider
> adding --limit-rate or --wait.
>

--
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/

Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls help]

Reply via email to