Re: [SLUG] Spider a website

2008-06-02 Thread James Polley
wget-smubble-yew-get. Wget works great for getting a single file or a very
simple all-under-this-tree setup, but it can take forever.

Try httrack - http://www.httrack.com/. Ignore the pretty little screenshots,
the linux commandline version does the same job, just requires much
command-line-fu. It handles simple javascript links, is intelligent about
fetching requisites (images, css etc) from off-domain without trying to
cache the whole internet, is multi-threaded - and is actually designed
specifically for the purpose of making a static, offline copy of a website.

The user's guide at http://www.httrack.com/html/fcguide.html goes through
most common scenarios for you, and $DISTRO should be able to apt-get install
it for you. Urrr.. or whatever broken tool distros unfortunate enough not to
have apt-get use.

On Tue, Jun 3, 2008 at 2:20 PM, Peter Rundle <[EMAIL PROTECTED]>
wrote:

> I'm looking for some recommendations for a *simple* Linux based tool to
> spider a web site and pull the content back into plain html files, images,
> js, css etc.
>
> I have a site written in PHP which needs to be hosted temporarily on a
> server which is incapable (read only does static content). This is not a
> problem from a temp presentation point of view as the default values for
> each page will suffice. So I'm just looking for a tool which will quickly
> pull the real site (on my home php capable server) into a directory that I
> can zip and send to the internet addressable server.
>
> I know there's a lot of code out there, I'm asking for recommendations.
>
> TIA's
>
> Pete
>
> --
> SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
> Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
>
>


-- 
There is nothing more worthy of contempt than a man who quotes himself -
Zhasper, 2004
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Ycros

On 03/06/2008, at 3:19 PM, Mary Gardiner wrote:


On Tue, Jun 03, 2008, Ycros wrote:
It's not always perfect however, as it can sometimes mess the URLs  
up,

but it's worth a try anyway.


The -k option to convert any absolute paths to relative ones can be
helpful with this (depending on what you meant by "mess the URLs up").


I think it was URLs in stylesheets and in javascript (well, there's  
not much you can do with the javascript really)

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Mary Gardiner
On Tue, Jun 03, 2008, Ycros wrote:
> It's not always perfect however, as it can sometimes mess the URLs up,  
> but it's worth a try anyway.

The -k option to convert any absolute paths to relative ones can be
helpful with this (depending on what you meant by "mess the URLs up").

-Mary
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Daniel Pittman
Peter Rundle <[EMAIL PROTECTED]> writes:

> I'm looking for some recommendations for a *simple* Linux based tool
> to spider a web site and pull the content back into plain html files,
> images, js, css etc.

Others have suggested wget, which works very well.  You might also
consider 'puf':

Package: puf
Priority: optional
Section: universe/web
Description: Parallel URL fetcher
 puf is a download tool for UNIX-like systems. You may use it to download
 single files or to mirror entire servers. It is similar to GNU wget
 (and has a partly compatible command line), but has the ability to do
 many downloads in parallel. This is very interesting, if you have a
 high-bandwidth internet connection.

This works quite well when, as it notes, presented with sufficient
bandwidth (and server resources) to have multiple links fetched at once.

Regards,
Daniel
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Richard Heycock
Excerpts from Peter Rundle's message of Tue Jun 03 14:20:08 +1000 2008:
> I'm looking for some recommendations for a *simple* Linux based tool to spider
> a web site and pull the content back into 
> plain html files, images, js, css etc.
> 
> I have a site written in PHP which needs to be hosted temporarily on a server
> which is incapable (read only does static 
> content). This is not a problem from a temp presentation point of view as the
> default values for each page will suffice. 
> So I'm just looking for a tool which will quickly pull the real site (on my
> home php capable server) into a directory 
> that I can zip and send to the internet addressable server.
> 
> I know there's a lot of code out there, I'm asking for recommendations.

wget can do that. Use the recurse option.

rgh

> TIA's
> 
> Pete
> 

-- 
+61 (0) 410 646 369
[EMAIL PROTECTED]

You're worried criminals will continue to penetrate into cyberspace, and
I'm worried complexity, poor design and mismanagement will be there to meet
them - Marcus Ranum
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Ycros
You could use wget to do this, it's installed on most distributions by  
default.


Usually you'd run it like this: wget --mirror -np http://some.url/
(the -np tells it not to recurse up to the parent, which is useful if  
you only want to mirror a subdirectory. I add it on out of habit.)


It's not always perfect however, as it can sometimes mess the URLs up,  
but it's worth a try anyway.


On 03/06/2008, at 2:20 PM, Peter Rundle wrote:

I'm looking for some recommendations for a *simple* Linux based tool  
to spider a web site and pull the content back into plain html  
files, images, js, css etc.


I have a site written in PHP which needs to be hosted temporarily on  
a server which is incapable (read only does static content). This is  
not a problem from a temp presentation point of view as the default  
values for each page will suffice. So I'm just looking for a tool  
which will quickly pull the real site (on my home php capable  
server) into a directory that I can zip and send to the internet  
addressable server.


I know there's a lot of code out there, I'm asking for  
recommendations.


TIA's

Pete

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Jonathan Lange
On Tue, Jun 3, 2008 at 2:20 PM, Peter Rundle <[EMAIL PROTECTED]> wrote:
> I'm looking for some recommendations for a *simple* Linux based tool to
> spider a web site and pull the content back into plain html files, images,
> js, css etc.
>
> I have a site written in PHP which needs to be hosted temporarily on a
> server which is incapable (read only does static content). This is not a
> problem from a temp presentation point of view as the default values for
> each page will suffice. So I'm just looking for a tool which will quickly
> pull the real site (on my home php capable server) into a directory that I
> can zip and send to the internet addressable server.
>
> I know there's a lot of code out there, I'm asking for recommendations.
>

I'd use 'wget'. From what you describe, 'wget -r' should be very close
to what you want. Consult the manpage for details about fiddling with
links etc.

jml
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Robert Collins
On Tue, 2008-06-03 at 14:20 +1000, Peter Rundle wrote:
> I'm looking for some recommendations for a *simple* Linux based tool to 
> spider a web site and pull the content back into 
> plain html files, images, js, css etc.
> 
> I have a site written in PHP which needs to be hosted temporarily on a server 
> which is incapable (read only does static 
> content). This is not a problem from a temp presentation point of view as the 
> default values for each page will suffice. 
> So I'm just looking for a tool which will quickly pull the real site (on my 
> home php capable server) into a directory 
> that I can zip and send to the internet addressable server.
> 
> I know there's a lot of code out there, I'm asking for recommendations.

wget :)

-Rob
-- 
GPG key available at: .


signature.asc
Description: This is a digitally signed message part
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

[SLUG] Spider a website

2008-06-02 Thread Peter Rundle
I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into 
plain html files, images, js, css etc.


I have a site written in PHP which needs to be hosted temporarily on a server which is incapable (read only does static 
content). This is not a problem from a temp presentation point of view as the default values for each page will suffice. 
So I'm just looking for a tool which will quickly pull the real site (on my home php capable server) into a directory 
that I can zip and send to the internet addressable server.


I know there's a lot of code out there, I'm asking for recommendations.

TIA's

Pete

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html