Re: How to crawl every urls on a website?

Bipin Parmar Tue, 12 Sep 2006 08:17:57 -0700

Jin,

Is it your intent to get the url list only? If it is
just one website, you can crawl the website using
nutch. Look at the "Intranet: Running the Crawl"
tutorial at
http://lucene.apache.org/nutch/tutorial8.html. Use a
very high number for depth, like 10. Once the crawl is
complete, you can extract all the urls from the
crawldb using the nutch readdb command and grep.

If you intent is to crawl every page on a website, the
same process. Just use a high depth value. Once the
crawl is complete, you can search and view the crawled
content using search pages. The tutorial describes the
search setup.

I hope I understood your question correctly.

Bipin

--- Jin Yang <[EMAIL PROTECTED]> wrote:

> How to generate the urls list of a website? Should
> we put 1 on 1 into 
> them? Like this?
> 
> www.apache.org/1.html
> www.apache.org/2.html
> www.apache.org/3.html
> 
> Have any tool or command can do this?
>

Re: How to crawl every urls on a website?

Reply via email to